Technical‎ > ‎

Dedup Info

                HashBackup Dedup Information

HashBackup has the ability to "dedup" data.  This means that if a
block of data is seen more than once during the backup, or has been
seen in a previous backup, it is only saved once.  This can decrease
backup time because of fewer writes to disk during the backup, can
decrease backup transmission time for offsite copy operations, and can
decrease the space required for storing backups.  This document gives
some guidance about how to effectively use dedup in your environment.

HashBackup's dedup feature is enabled with the -D option or the
dedup-mem config option.  By default, dedup is off.  The value
following -D (or dedup-mem) is the amount of memory (RAM) you wish to
dedicate to the dedup function.  HashBackup itself will use an
additional 100-120MB of memory above this.  For example, -D1g will use
up to 1GB of RAM for dedup.  If dedup is enabled with the config
command, for example, hb config -c backupdir dedup-mem 1g, the -D
option isn't necessary.

The default for dedup is off (no -D option or -D0), because enabling
it requires some thought about how much memory you have to dedicate to
dedup information.  You may leave dedup disabled for some backups and
enable it for others, even within the same -c backup directory.  If
you have data that will not benefit much from dedup, for example, a
large collection of images, it is recommended that you leave dedup
turned off (the default).

If you save data with dedup turned off, that data will never be used
in the future for dedup.  So if you create your first backup with
dedup off and then decide you really want dedup enabled, you should
clear the backup and make a new initial backup with dedup enabled.

Dedup, block size, and metadata
-------------------------------

HashBackup splits files into blocks during the backup.  Sometimes it
chooses a block size, sometimes it uses a variable block size, or you
can force a specific block size with the -B option.  In general, dedup
and HashBackup metadata overhead are inversely related on the block
size.  That is, a smaller block size makes dedup work better, but will
generate more blocks and require more block tracking metadata.  A
larger block size means fewer blocks and less overhead, but does not
dedup as well.

Sizing the dedup table
----------------------

As a general guideline, HashBackup can dedup 8 million blocks of
backup data with 100MB of memory.  With a block size of 64K, HB can
dedup one 524GB data file (8M x 64K) or 8 million files of 64K or
less, using just 100MB of RAM.  With a larger block size of 1MB and
-D100M, HB can dedup one 8TB file (8M x 1MB), or 8 million files of
1MB or less.  The 32-bit versions of HashBackup have a dedup memory
limit of 2gb.  The 64-bit versions currently have a limit of 48GB.

Initially, HB does not use all of the dedup memory.  Instead, it
starts with a small dedup table and doubles it as necessary until the
maximum limit is reached.  Specifying -D4G for example will not
immediately allocate a huge 4GB block of memory.  But, if the memory
limit is too high for your system and HB does try to use it all, it
will cause excessive paging and likely make your system "lock up".

It is not necessary for all blocks to fit into the dedup table.  HB
will still do a very good job of deduping a backup even after the
dedup table is full.  It is much more important that the dedup table
fit completely into memory, without paging or swapping, than for the
dedup table to be big enough to hold all blocks that will ever be
stored in the backup.

For large backups and backups of VM disk images (backing up the large
VM image from the host, not backing up inside a VM), use a reasonable
memory limit, perhaps half of your available, free memory.  This can
be checked with tools like "top".  By specifying the amount of memory
to use for dedup information, HashBackup can maintain a constant
backup rate, even for huge backups.  The trade-off is that if the size
you specify is not large enough to cover all blocks in the backup,
dedup effectiveness will decrease.  It is preferable to maintain a
high backup rate and make a "best effort" to dedup rather than have
perfect dedup that is extremely slow or locks the system because it
uses too much memory.

Resizing the dedup table
------------------------

During backups, HB will automatically resize the dedup table as it
fills, up to the limit specified with -D.  Backup reports the fill
percentage of the dedup table when it starts.  If the dedup table is
full (the second number displayed) and you want to expand the size,
just increase it with -D or the dedup-mem config option.  The next
backup will expand the dedup table for future backups.

It could be that many blocks have already fallen out of the dedup
table because it was at capacity.  To include previous blocks in your
newly-expanded dedup table, delete the hash.db file.  On the next
backup (with increased -D), HB will rebuild the dedup table and
include as many blocks as it can.

It's not common, but if you want to shrink the dedup table, delete
hash.db and use a smaller amount of memory with -D or dedup-mem.

Simulations
-----------

There are many variables to take into account when trying to optimize
backups:

- how big are your files?
- how often are they modified?
- how much can they be deduped with an unlimited dedup table?
- how much can they be compressed?
- best block size to use?

Most of this is dependent on your particular data, so guidelines are
hard to give.  HashBackup can be used to perform experiments to figure
out which options work best in your environment.  To run an
experiment, use hb config after hb init:

$ hb init -c exp1
$ hb config -c exp1 simulated-backup true

Now when you backup with -c exp1, no backup data is actually written,
but all of the other actions happen: files are split into blocks,
compressed, deduped, etc.  You can also do daily incremental backups
into this backup directory for a few days for a better
simulation. Then use hb stats -c exp1 to view statistics about your
simulated backup.  Add -v for more explanation of the statistics.


Comments