Deduplication

HashBackup has the ability to deduplicate data. This means that if a block of data is seen more than once during the backup, or has been seen in a previous backup, it is only saved once. This can decrease backup time because of fewer writes to disk during the backup, can decrease backup transmission time for offsite copy operations, and can decrease the space required for storing backups. This document gives some guidance about how to effectively use dedup in your environment.

HashBackup’s dedup feature is enabled with the -D option or the dedup-mem config option. By default, dedup uses 100MB. The value following -D (or dedup-mem) is the amount of memory (RAM) you wish to dedicate to the dedup function. HashBackup itself will use an additional 100-120MB of memory above this. For example, -D1g will use up to 1GB of RAM for dedup. If dedup is enabled with a config command, for example, hb config -c backupdir dedup-mem 1g, the -D backup option isn’t necessary.

The default for dedup is 100M, since most computer systems will have this much RAM available. You might want dedup disabled for some backups and enable it for others, even within the same -c backup directory. If you have data that will not benefit much from dedup, for example, a large collection of images with no duplicates, use -D0 to disable dedup.

If you save data with dedup turned off, that data will never be used in the future for dedup. So if you create your first backup with dedup off and then decide you really want dedup enabled, you should clear the backup and make a new initial backup with dedup enabled.

Block Size

HashBackup splits files into blocks during the backup. Sometimes it chooses a block size, sometimes it uses a variable block size, or you can force a specific block size with the -B backup option. In general, dedup and HashBackup metadata overhead are inversely related on the block size. That is, a smaller block size makes dedup work better, but will generate more blocks and require more block tracking metadata. A larger block size means fewer blocks and less overhead, but does not dedup as well.

Capacity

As a general guideline, HashBackup can dedup 8 million blocks of backup data with 100MB of memory. With a block size of 64K, HB can dedup one 524GB data file (8M x 64K) or 8 million files of 64K or less, using just 100MB of RAM. With a larger block size of 1MB and -D100M, HB can dedup one 8TB file (8M x 1MB), or 8 million files of 1MB or less. The 32-bit version of HashBackup has a dedup memory limit of 2gb. The 64-bit version currently has a memory limit of 48GB, or about 4 billion blocks.

Initially, HB does not use all of the dedup memory. Instead, it starts with a small dedup table and doubles it as necessary until the maximum limit is reached. Specifying -D4G for example will not immediately allocate a huge 4GB block of memory. But, if the memory limit is too high for your system and HB does try to use it all, it will cause excessive paging and likely make your system lock up.

It is not necessary for all blocks to fit into the dedup table. HB will still do a very good job of deduping a backup even after the dedup table is full. It is much more important that the dedup table fit completely into memory, without paging or swapping, than for the dedup table to hold all blocks that will ever be stored in the backup.

For large backups and backups of VM disk images (backing up the large VM image from the host, not backing up inside a VM), use a reasonable memory limit, perhaps half of your available, free memory. This can be checked with tools like top. By specifying the amount of memory to use for dedup information, HashBackup can maintain a constant backup rate, even for huge backups. The trade-off is that if the size you specify is not large enough to cover all blocks in the backup, dedup effectiveness will decrease. It is preferable to maintain a high backup rate and make a best effort to dedup rather than have perfect dedup that is extremely slow or locks the system because it uses too much memory.

Resizing

During backups, HB will automatically resize the dedup table as it fills, up to the limit specified with -D or dedup-mem. Backup reports the fill percentage of the dedup table when it starts. If the dedup table is full (the second number displayed) and you want to expand the size, just increase it with -D or the dedup-mem config option. The next backup will expand the dedup table for future backups.

It’s not common, but you can also shrink the dedup table by reducing either -D or the dedup-mem config option.

Simulations

There are many variables to take into account when trying to optimize backups:

  • how big are your files?

  • how often are they modified?

  • how much can they be deduped with an unlimited dedup table?

  • how much can they be compressed?

  • best block size to use?

Most of this is dependent on your particular data, so guidelines are hard to give. HashBackup can be used to perform experiments to figure out which options work best in your environment. To run an experiment, use hb config after hb init to enable simulated backups:

$ hb init -c exp1
$ hb config -c exp1 simulated-backup true

Now when you backup with -c exp1, no backup data is actually written, but all of the other actions happen: files are split into blocks, compressed, deduped, etc. You can also do daily incremental backups into this backup directory for a few days for a better simulation. Then use hb stats -c exp1 to view statistics about your simulated backup. Add -v for more explanation of the statistics.