Deduplication
HashBackup has the ability to deduplicate data to reduce backup size. If an identical block of data is seen more than once during the backup or has been seen in a previous backup, it is only saved once. This decreases backup time because of fewer writes to disk during the backup, decreases backup transmission time for offsite copy operations, and decreases the space required for storing backups. This document gives some guidance about how to effectively use dedup in your environment.
HashBackup’s dedup feature is enabled with the -D
backup option or
the dedup-mem
config option. The -D
backup option overrides
dedup-mem
for one backup. The value following -D
(or dedup-mem
)
is the amount of memory (RAM) you wish to dedicate to the dedup
function. For example, -D1g
will use up to 1GB of RAM for dedup.
The default for dedup-mem
is 100M, since most computer systems will
have this much RAM available. You might want dedup disabled for some
backups and enabled for others, even within the same -c
backup
directory. If you have data that will not benefit from dedup, for
example, a large collection of images with no duplicates, use -D0 to
disable dedup. If data is saved with dedup turned off, that data will
never be used in the future for dedup.
Block Size
HashBackup splits files into blocks during the backup. Sometimes it
chooses a block size, sometimes it uses a variable block size, or you
can force a specific block size with the -B
and -V
backup options
and block-size-ext
config option. In general, dedup efficacy and
HashBackup metadata overhead are inversely related to the block size:
a smaller block size makes dedup work better, but generates more
blocks and requires more block tracking metadata in the database. A
larger block size creates fewer blocks with less overhead, but does
not dedup as well.
Capacity
HashBackup can dedup about 8 million blocks of backup data with 100MB of memory. With a block size of 64K, HB can dedup one 524GB data file (8M x 64K) or 8 million files of 64K or less, using just 100MB of RAM. With a larger block size of 1MB and -D100M, HB can dedup one 8TB file (8M x 1MB), or 8 million files of 1MB or less. The current memory limit is 48GB, or about 4 billion blocks.
Initially, HB does not use all of the dedup memory. Instead, it
starts with a small dedup table and doubles it as necessary until the
maximum limit is reached. Specifying -D4G
for example will not
immediately allocate a huge 4GB block of memory. But, if the memory
limit is too high for your system and HB does try to use it all, it
will cause excessive paging (swapping) and likely make your system
appear to freeze.
It is not necessary for all blocks to fit into the dedup table. HB will still do a very good job of deduping a backup even after the dedup table has reached maximum size. It is much more important that the dedup table fit completely into memory, without paging or swapping, than for the dedup table to hold all blocks that will ever be stored in the backup.
For large backups and backups of VM disk images (backing up the large
VM image from the host, not backing up inside a VM), use a reasonable
memory limit, perhaps half of your available, free memory. This can
be checked with tools like top
. By specifying the amount of memory
to use for dedup information, HashBackup can maintain a constant
backup rate, even for huge backups. The trade-off is that if the size
specified is not large enough to cover all blocks in the backup, dedup
effectiveness decreases. It is preferable to maintain a high backup
rate and make a best effort to dedup rather than have perfect dedup
that is extremely slow or locks the system because it uses too much
memory. Simulated backups, described below, are a useful dedup
capacity planning tool for your specific data.
With sharded backups, each shard runs in parallel and has its own
dedup table. To use a total of 1GB of RAM for dedup with 4 shards,
set dedup-mem
to 250MB.
Resizing
During backups, HB will automatically resize the dedup table as it
fills, up to the limit specified with -D
or dedup-mem
. Backup
reports the fill percentage of the dedup table when it starts, and
detailed information is also available with the stats
command. If
the dedup table is full (the second number displayed by the backup
command), its size can be increased with the dedup-mem
config
option. The next backup will expand the dedup table for future
backups.
It’s not common, but you can shrink the dedup table by reducing either
-D
or the dedup-mem
config
option. This does not affect already
deduped data in previous backups. HB may automatically shrink the
dedup table if it is very under utilized. This can happen if a lot of
data is backed up, greatly expanding the dedup table, and then this
data is removed from the backup with the rm
command or is expired
with the retain
command.
Simulations
There are many variables to take into account when trying to optimize backups:
-
how big are your files?
-
how often are they modified?
-
how much could they be deduped with an unlimited dedup table?
-
how much can they be compressed?
-
best block size to use?
Most of this is dependent on your particular data, so guidelines are
hard to give. HashBackup can be used to perform experiments to figure
out which options work best in your environment. To run an
experiment, use hb config
after hb init
to enable simulated
backups:
$ hb init -c exp1
$ hb config -c exp1 simulated-backup true
Now when you backup with -c exp1
, no backup data is actually
written, but all of the other actions happen: files are split into
blocks, compressed, deduped, etc. You can also do daily incremental
backups into this backup directory for a few days for a better
simulation. Then use hb stats -c exp1
to view statistics about your
simulated backup. Add -v
for more explanation of the statistics.
For very large filesystems, the --sample
backup option can be used
to backup a percentage of the data to help plan dedup capacity.