Deduplication

HashBackup has the ability to deduplicate data to reduce backup size. If an identical block of data is seen more than once during the backup or has been seen in a previous backup, it is only saved once. This decreases backup time because of fewer writes to disk during the backup, decreases backup transmission time for offsite copy operations, and decreases the space required for storing backups. This document gives some guidance about how to effectively use dedup in your environment.

HashBackup’s dedup feature is enabled with the -D backup option or the dedup-mem config option. The -D backup option overrides dedup-mem for one backup. The value following -D (or dedup-mem) is the amount of memory (RAM) you wish to dedicate to the dedup function. For example, -D1g will use up to 1GB of RAM for dedup.

The default for dedup-mem is 100M, since most computer systems will have this much RAM available. You might want dedup disabled for some backups and enabled for others, even within the same -c backup directory. If you have data that will not benefit from dedup, for example, a large collection of images with no duplicates, use -D0 to disable dedup. If data is saved with dedup turned off, that data will never be used in the future for dedup.

Block Size

HashBackup splits files into blocks during the backup. Sometimes it chooses a block size, sometimes it uses a variable block size, or you can force a specific block size with the -B and -V backup options and block-size-ext config option. In general, dedup efficacy and HashBackup metadata overhead are inversely related to the block size: a smaller block size makes dedup work better, but generates more blocks and requires more block tracking metadata in the database. A larger block size creates fewer blocks with less overhead, but does not dedup as well.

Capacity

HashBackup can dedup about 8 million blocks of backup data with 100MB of memory. With a block size of 64K, HB can dedup one 524GB data file (8M x 64K) or 8 million files of 64K or less, using just 100MB of RAM. With a larger block size of 1MB and -D100M, HB can dedup one 8TB file (8M x 1MB), or 8 million files of 1MB or less. The current memory limit is 48GB, or about 4 billion blocks.

Initially, HB does not use all of the dedup memory. Instead, it starts with a small dedup table and doubles it as necessary until the maximum limit is reached. Specifying -D4G for example will not immediately allocate a huge 4GB block of memory. But, if the memory limit is too high for your system and HB does try to use it all, it will cause excessive paging (swapping) and likely make your system appear to freeze.

It is not necessary for all blocks to fit into the dedup table. HB will still do a very good job of deduping a backup even after the dedup table has reached maximum size. It is much more important that the dedup table fit completely into memory, without paging or swapping, than for the dedup table to hold all blocks that will ever be stored in the backup.

For large backups and backups of VM disk images (backing up the large VM image from the host, not backing up inside a VM), use a reasonable memory limit, perhaps half of your available, free memory. This can be checked with tools like top. By specifying the amount of memory to use for dedup information, HashBackup can maintain a constant backup rate, even for huge backups. The trade-off is that if the size specified is not large enough to cover all blocks in the backup, dedup effectiveness decreases. It is preferable to maintain a high backup rate and make a best effort to dedup rather than have perfect dedup that is extremely slow or locks the system because it uses too much memory. Simulated backups, described below, are a useful dedup capacity planning tool for your specific data.

With sharded backups, each shard runs in parallel and has its own dedup table. To use a total of 1GB of RAM for dedup with 4 shards, set dedup-mem to 250MB.

Resizing

During backups, HB will automatically resize the dedup table as it fills, up to the limit specified with -D or dedup-mem. Backup reports the fill percentage of the dedup table when it starts, and detailed information is also available with the stats command. If the dedup table is full (the second number displayed by the backup command), its size can be increased with the dedup-mem config option. The next backup will expand the dedup table for future backups.

It’s not common, but you can shrink the dedup table by reducing either -D or the dedup-mem config option. This does not affect already deduped data in previous backups. HB may automatically shrink the dedup table if it is very under utilized. This can happen if a lot of data is backed up, greatly expanding the dedup table, and then this data is removed from the backup with the rm command or is expired with the retain command.

Simulations

There are many variables to take into account when trying to optimize backups:

  • how big are your files?

  • how often are they modified?

  • how much could they be deduped with an unlimited dedup table?

  • how much can they be compressed?

  • best block size to use?

Most of this is dependent on your particular data, so guidelines are hard to give. HashBackup can be used to perform experiments to figure out which options work best in your environment. To run an experiment, use hb config after hb init to enable simulated backups:

$ hb init -c exp1
$ hb config -c exp1 simulated-backup true

Now when you backup with -c exp1, no backup data is actually written, but all of the other actions happen: files are split into blocks, compressed, deduped, etc. You can also do daily incremental backups into this backup directory for a few days for a better simulation. Then use hb stats -c exp1 to view statistics about your simulated backup. Add -v for more explanation of the statistics.

For very large filesystems, the --sample backup option can be used to backup a percentage of the data to help plan dedup capacity.