Deduplication
HashBackup has the ability to deduplicate data. This means that if a block of data is seen more than once during the backup, or has been seen in a previous backup, it is only saved once. This can decrease backup time because of fewer writes to disk during the backup, can decrease backup transmission time for offsite copy operations, and can decrease the space required for storing backups. This document gives some guidance about how to effectively use dedup in your environment.
HashBackup’s dedup feature is enabled with the -D
option or the
dedup-mem
config
option. By default, dedup uses 100MB. The value
following -D
(or dedup-mem
) is the amount of memory (RAM) you wish to
dedicate to the dedup function. HashBackup itself will use an
additional 100-120MB of memory above this. For example, -D1g
will use
up to 1GB of RAM for dedup. If dedup is enabled with a config
command, for example, hb config -c backupdir dedup-mem 1g
, the -D
backup option isn’t necessary.
The default for dedup is 100M, since most computer systems will have
this much RAM available. You might want dedup disabled for some
backups and enable it for others, even within the same -c
backup
directory. If you have data that will not benefit much from dedup,
for example, a large collection of images with no duplicates, use
-D0 to disable dedup.
If you save data with dedup turned off, that data will never be used in the future for dedup. So if you create your first backup with dedup off and then decide you really want dedup enabled, you should clear the backup and make a new initial backup with dedup enabled.
Block Size
HashBackup splits files into blocks during the backup. Sometimes it
chooses a block size, sometimes it uses a variable block size, or you
can force a specific block size with the -B
backup option. In
general, dedup and HashBackup metadata overhead are inversely related
on the block size. That is, a smaller block size makes dedup work
better, but will generate more blocks and require more block tracking
metadata. A larger block size means fewer blocks and less overhead,
but does not dedup as well.
Capacity
As a general guideline, HashBackup can dedup 8 million blocks of backup data with 100MB of memory. With a block size of 64K, HB can dedup one 524GB data file (8M x 64K) or 8 million files of 64K or less, using just 100MB of RAM. With a larger block size of 1MB and -D100M, HB can dedup one 8TB file (8M x 1MB), or 8 million files of 1MB or less. The 32-bit version of HashBackup has a dedup memory limit of 2gb. The 64-bit version currently has a memory limit of 48GB, or about 4 billion blocks.
Initially, HB does not use all of the dedup memory. Instead, it
starts with a small dedup table and doubles it as necessary until the
maximum limit is reached. Specifying -D4G
for example will not
immediately allocate a huge 4GB block of memory. But, if the memory
limit is too high for your system and HB does try to use it all, it
will cause excessive paging and likely make your system lock up.
It is not necessary for all blocks to fit into the dedup table. HB will still do a very good job of deduping a backup even after the dedup table is full. It is much more important that the dedup table fit completely into memory, without paging or swapping, than for the dedup table to hold all blocks that will ever be stored in the backup.
For large backups and backups of VM disk images (backing up the large
VM image from the host, not backing up inside a VM), use a reasonable
memory limit, perhaps half of your available, free memory. This can
be checked with tools like top
. By specifying the amount of memory
to use for dedup information, HashBackup can maintain a constant
backup rate, even for huge backups. The trade-off is that if the size
you specify is not large enough to cover all blocks in the backup,
dedup effectiveness will decrease. It is preferable to maintain a
high backup rate and make a best effort to dedup rather than have
perfect dedup that is extremely slow or locks the system because it
uses too much memory.
Resizing
During backups, HB will automatically resize the dedup table as it
fills, up to the limit specified with -D
or dedup-mem
. Backup
reports the fill percentage of the dedup table when it starts. If the
dedup table is full (the second number displayed) and you want to
expand the size, just increase it with -D
or the dedup-mem
config
option. The next backup will expand the dedup table for
future backups.
It’s not common, but you can also shrink the dedup table by reducing
either -D
or the dedup-mem
config
option.
Simulations
There are many variables to take into account when trying to optimize backups:
-
how big are your files?
-
how often are they modified?
-
how much can they be deduped with an unlimited dedup table?
-
how much can they be compressed?
-
best block size to use?
Most of this is dependent on your particular data, so guidelines are
hard to give. HashBackup can be used to perform experiments to figure
out which options work best in your environment. To run an
experiment, use hb config
after hb init
to enable simulated
backups:
$ hb init -c exp1
$ hb config -c exp1 simulated-backup true
Now when you backup with -c exp1
, no backup data is actually
written, but all of the other actions happen: files are split into
blocks, compressed, deduped, etc. You can also do daily incremental
backups into this backup directory for a few days for a better
simulation. Then use hb stats -c exp1
to view statistics about your
simulated backup. Add -v
for more explanation of the statistics.