Backup

Performs a backup. Advanced options are described later, but commonly-used options are:

$ hb backup [-c backupdir] [-D mem] [-v level] [-X]
            path1 path2 ...

Common Options

-c the backup is stored in the local backup directory specified with the -c option or a default backup directory according to built-in rules if -c isn’t used. If a full local copy of the backup is not wanted in the backup directory, set cache-size-limit with the config command and create a dest.conf file in the backup directory. See Destinations for details and examples.

-D enable dedup, saving only 1 copy of identical data blocks. -D is followed by the amount of memory to dedicate to the dedup operation, for example: -D1g would use 1 gigabyte of RAM, -D100m would use 100mb. The default is -D100M. The more memory, the more duplicate data HashBackup can detect, though it is still effective even with limited memory. Dedup is also enabled with the dedup-mem config option. See Dedup for detailed information and recommendations. -D0 disables dedup for one backup. Unlike most options, -D uses a factor of 1024, so -D100M means 100 * 1024 * 1024 bytes.

-v control how much output is displayed, higher numbers for more:

-v0 = print version, backup directory, copy messages, error count
-v1 = include names of skipped filesystems
-v2 = include names of files backed up (this is the default)
-v3 = include names of excluded files

-X cross mount points (separate file systems). Unless -X is used, each filesystem to backup must be explicitly listed on the command line . Use the df command to see a list of mounted filesystems. To see what filesystem a directory is on, use df followed by a pathname, or use df . to see the current directory’s filesystem.

Backup Operation

Each backup creates a new version in the backup directory containing all files modified since the previous backup. An easy way to think of this is a series of incremental backups, stacked one on top of the other. HashBackup presents the illusion that every version is a full backup, while performing the backups at incremental speeds and using incremental backup storage space and bandwidth. While full backups are an option, there is no requirement to do them. HashBackup is designed to not need full backups, yielding huge savings of time, disk space, and bandwidth compared to traditional backup methods. The retain command explains how to maintain weekly or monthly snapshots - or both!

Specifying the Backup Directory

All commands accept a -c option to specify the backup directory. HashBackup will store backup data and the encryption key in this directory. If -c is not specified on the command line, the environment variable HASHBACKUP_DIR is checked. If this environment variable doesn’t exist, the directory ~/hashbackup is used if it exists. If ~/hashbackup doesn’t exist, /var/hashbackup is used. Backup directories must be first be initialized with the hb init command. If a complete local copy of your backup is not wanted, set the cache-size-limit config option with the config command to limit the size of the local backup directory, then setup a dest.conf file to send backup data to a destination.

HashBackup saves all file properties in a database, so backups can be stored on any type of "dumb" storage: FAT, VFAT, CIFS/SMB/Samba, NFS, USB thumb drives, SSHFS, WebDAV, FTP servers, etc. All file attributes, including hard links, ACLs, and extended attributes will be accurately saved and restored, even if not supported by the storage system. An encrypted incremental backup of the backup database is automatically included with every backup.

It’s possible to backup directly to a mounted remote filesystem or USB drive with the -c option. However, the backup performance may be slower than with a local filesystem, and these may not provide robust sync and lock facilities. Sync facilities are used to ensure that the backup database doesn’t get corrupted if the computer halts unexpectedly during the backup process, for example, if there is a power outage. Lock facilities ensure that two backups don’t run at the same time, possibly overwriting each others' backup files. It’s best to specify a local backup directory with -c and let HashBackup copy the backup data to mounted remote storage or USB drive by using a Dir destination in the dest.conf file in the backup directory.

Destinations

Remote destinations are optional. They are setup in the dest.conf file, located in the backup directory. See the Destinations page for more detailed info on how to configure destinations for SSH, RSYNC, and FTP servers, Amazon S3, Google Storage, Backblaze, Gmail, IMAP/emailservers, and simple directory (dir) destinations. Dir destinations are used to copy the backup to USB thumb drives, NFS, WebDAV, Samba, and any other storage that is mounted as a filesystem.

The backup command creates backup archive files in the backup directory. After arc files fill up, they can be transmitted to one or more remote destinations, configured with a dest.conf file. Transmissions run in parallel with the backup process. After the backup has completed, it waits for all files to finish transmitting. By default, archives are kept in the local backup directory after sending to remote destinations, making restores very fast because data does not have to be downloaded from a remote site. Keeping a local copy also adds redundancy and makes it easier to pack remote archives (remove deleted or expired backup data) because archives don’t have to be downloaded first. It is possible to delete archives from the local backup directory as they are sent to destinations by using the cache-size-limit config option.

If storing the backup directly to a USB or other removable drive with the -c option, and that’s the only backup copy you want, there is no need to use the dest.conf file. dest.conf is only used to copy your backup, usually to a remote site.

Important Security Note: the backup encryption key is stored in the file key.conf in the backup directory. Usually this is a local directory, but when writing a backup directly to mounted remote storage with the -c option, for example, Google Drive, DropBox, etc., or to a removable USB drive, be sure to set a passphrase when initializing the backup directory. This is done with the -p ask option to hb init or hb rekey. Both the passphrase and key are required to access the backup. When writing directly to mounted remote storage that you control, such as an NFS server, a passphrase may not be necessary. If a dest.conf Dir destination is used to copy the backup to remote storage or a removable drive and the backup directory and key are on a local drive, a passphrase is not necessary because the key is never copied to destinations.

Excludes

An inex.conf file is created by the init command in the backup directory and automatically excludes a handful of system files and directories, like swapfiles, /tmp, core files, /proc, /sys, and hibernation files. The backup directory itself is always automatically excluded.

To exclude other files, edit the inex.conf file in the backup directory. The format of this file is:

# comment lines
E(xclude) <pathname>

Example inex.conf:

# exclude all .wav files:
Exclude *.wav

# exclude the /Cache directory and its contents:
e /Cache

# save the /Cache directory itself, but don't save the contents.
# Requesting a backup of /Cache/xyz still works.
e /Cache/

# save the /Cache directory itself, but don't save the contents.
# Requesting a backup of /Cache/xyzdir will save the directory itself
# since it was explicitly requested, but will not save xyzdir's
# contents
e /Cache/*

Blank lines and lines beginning with # are ignored. Any abbreviation of the exclude keyword at the beginning of the line is recognized. The exclude keyword is only required when the pathname contains spaces. Tilde characters ~ are not expanded into user directories. If the pathname begins with slash, the path is absolute. Otherwise, */ is prepended and the specified file will be excluded wherever it occurs in the filesystem. Wildcards are matched using Python’s fnmatch rules, which are similar to glob wildcard rules on the command line. The * wildcard matches across slashes, so excluding a*i (which becomes */a*i) will exclude /abc/def/ghi, but excluding b*i (which becomes */b*i) will not exclude /abc/def/ghi.

There are several other ways to exclude files from the backup:

config option no-backup-tag can be set to a list of filenames, separated by commas. If a directory contains any of these files, only the directory itself and the tag files are saved. HB does not read or check the contents of tag files. Typical values here are .nobackup or CACHEDIR.TAG
files with the nodump flag set are not backed up (on Linux, this requires setting config option backup-linux-attrs to true)
config option no-backup-ext can be set to a list of comma-separated extensions, with or without dots. Any files ending in one of the extensions listed is not backed up. This test is done without regard to uppercase or lowercase, so avi will exclude both .avi and .AVI files

Web browser and other caches are difficult to backup for a number of reasons (see Browser Caches note), and should always be excluded from backups, especially on large servers that store caches for many users.

Advanced Options

The backup command has more options that may be useful:

-B blocksize the backup command splits files into blocks of data. It chooses a "good" block size based on the type of file being saved and by default uses variable-sized blocks. Variable block sizes work well for files that change in unpredictable ways, like a text file. For files that change in fixed-size units - raw database files (not text dumps!), VM images, etc. - forcing a specific, fixed block size works better. Fixed-size blocks can be any size from 128 bytes to 2GB-1 (2,147,483,647) bytes.

-B forces a fixed block size, which usually dedups less than a variable block size except for cases where a file is updated in fixed chunks; then it is usually more efficient than variable blocks. Larger block sizes are more efficient in general but dedup less; small block sizes cause more overhead but dedup better. Sometimes the extra dedup savings of a small block size is less than the extra metadata needed to track more blocks. Since this is very data dependent, experiment with actual data to determine the best block size to use.

-B1m usually speeds up backups on a single CPU system because it disables variable block sizes. -B4M or higher is useful when files do not have much common data, for example media files. For identical files, backup will only save 1 copy. A huge block size like 64M may be useful for large scientific data files that have little duplicate data. A large block size does not make small files use more backup space. The simulated-backup config option with different block sizes is helpful for determining the best block size. Unlike most options, -B uses a multiplier of 1024, so -B4K means 4096 bytes rather than 4000 bytes.

IMPORTANT: very large block sizes will use a lot of RAM. A good estimate is 8x block size with a couple of CPUs, or more if there are more CPUs. -p0 disables multi-threading and will use less RAM and CPU but will also be slower.

-F filesources read pathnames to be saved from file sources. More than one file source can be used. This is often used with very large file systems to avoid having to scan for modified files by having the processes that modify files update a file list. Each file source is a:

pathname of a file containing pathnames to backup, 1 per line. Blank lines and lines beginning with # (comments) are ignored. Pathnames to backup can be either individual files or directories. If a pathname doesn’t exist in the filesystem but is present in the backup, it is marked deleted in the backup.
pathname of a directory, where each file in the directory contains a list of pathnames to backup as above.

The backup command line can still have pathnames to backup in addition to -F, but these must come before -F.

--full the backup command usually backs up only modified files and determines this by looking at mtime, ctime, and other file attributes for each file. This option overrides that behavior and causes every file to be saved, even if it has not changed. Taking a full backup adds redundancy to the backup data as it will not dedup against previous backups. Another way to accomplish this is to create a new backup directory on both local and remote storage, but this is harder to manage and retain will not work across multiple backup directories. Full backups also limit the number of incremental archives required to do large restores, though in testing, restoring a complete OSX system from 3 years of daily incrementals took only 15% longer than restoring from a full backup.

-m maxfilesize skip files larger than maxfilesize. The size can be specified as 100m or 100mb for 100 megabytes, 3g or 3gb for 3 gigabytes, etc. This limit does not apply to fifos or devices since they are always zero size.

--maxtime time specifies the maximum time to spend saving files. When this time is exceeded, the backup stops and waits for uploads to finish using --maxwait, which is adjusted based on how long the backup took.

--maxwait time when used by itself, specifies the maximum time to wait for files to be transmitted to remote destinations after the backup has finished. When used with --maxtime, the wait time is increased if the backup finishes with time remaining. The special value off disables all destinations for this one backup. This is useful for example to avoid sending hourly backups offsite and letting the nightly backup send them.

These two options have several uses:

Large initial backups take much longer than incremental backups. Using these options allow large initial backups to span several days without going outside the time window reserved for backups.
They prevent incrementals from running into production time when a large amount of data changes.
If a new destination is added to an existing backup, it may take days for existing backup data to be uploaded to the new destination. These options limit how long each backup spends trying to upload to the new destination.

If a backup does not complete because of --maxtime, the next backup using the same command line will restart where the previous backup left off, without rescanning the entire filesystem. This allows backups of huge filesystems with tens of millions of files over a period of time.

Time limits may be exceeded for various technical reasons. Large arc files, slow upload rates, and many worker threads cause more variability in the timing. It’s a good idea to initially set the time limits an hour less than required to get a feel for how much variability occurs in a specific backup environment.

Examples for --maxtime and --maxwait

--maxtime 1h means backup for up to 1 hour then wait the remainder of the hour for uploads to finish, ie, total time is limited to 1 hour.

--maxwait 1h means backup everything requested, however long it takes, but only wait 1 hour after the backup for uploads to finish.

--maxtime 1h --maxwait 2h means backup for up to 1 hour, then wait 2 hours + the remaining backup time for uploads to finish, ie, the total time is limited to 3 hours.

--maxtime 1h --maxwait 1y means backup for 1 hour, then wait a year for uploads to finish, ie, only the backup time is limited.

If uploads do not finish, the remote backup will not be in a fully consistent state because some arc files have not been uploaded and are only present in the local backup directory. The local backup directory will always be in a consistent state and restores will work fine in this situation. But if the local backup directory is lost, a recover command can only recover the data that was uploaded and a selftest --fix may be needed to compensate for missing arc files that were never uploaded. --maxwait 1y ensures all remotes are consistent and is therefore the safest setting, and is the default without either option, but it does not limit backup time.

--no-ino on Unix filesystems, every file has a unique inode number. If two paths have the same inode number, they are hard linked and both paths refer to the same file data. Some filesystems do not have real inode numbers: FUSE (sshfs, UnRaid) and Samba/SMB/CIFS for example. Instead, they invent inode numbers and cache them for a while, sometimes for many days. HashBackup normally verifies a path’s inode number during backups, and if it changes, it triggers a backup of the path. Inode numbers are also used to detect hard links. If a filesystem does not have real inode numbers, it causes unexpected and unpredictable full backups of files that haven’t changed and after enough of these mistakes, HashBackup will stop with an error. The --no-ino option prevents these unnecessary full backups.

A negative side-effect of --no-ino is that hard-link detection is disabled since it relies on stable inode numbers. With --no-ino, hard-linked files are backed up like regular files and if restored, are not hard-linked.

--no-mtime there are two timestamps on every Unix file: mtime is the last time a file’s data was modified, and ctime is the last time a file’s attributes were modified (permissions, etc). When a file’s data is changed, both mtime and ctime are updated. If only the attributes are changed, only ctime is updated. Because mtime can be set by user programs, some administrators may not trust it to indicate whether a file’s data has changed. The --no-mtime option tells the backup command to verify whether file contents have changed by computing a strong checksum and comparing that with the previous backup rather than trusting mtime. This is not recommended for normal use since the backup program has to read every unmodified file’s data to compute the checksum, then possibly read it again to backup the file.

-p procs specifies how many additional processes to use for the backup. Normally HashBackup uses 1 process on single-CPU systems, and several processes on multi-core systems with more than 1 CPU. Multiple processes speed up the backup but also put a higher load on your system. To reduce the performance impact of a backup on your system, you may want to use -p0 to force using only 1 CPU. Or, with a very fast hard drive and many CPU cores, you may want to use more than just a few processes. On a dual-core system, HashBackup will use both cores; on a 4-core or more system it will use 4. To use more than 4 cores, -p must be used. Backup also displays %CPU utilization at the end of the backup. If %CPU is 170%, it means that on average, HashBackup could only make use of 1.7 CPUs because of disk I/O, network, or other factors, so -p2 is fine. Experimenting is the best guide here. A negative value means to reserve (not use) cores, so -p-2 means to use all but 2 cores.

--sample p specifies a percentage from 1 to 100 of files to sample for inclusion in the backup. This is a percentage of the files seen, not a percentage of the amount of data in the files. For example, --sample 10 means to only backup 10% of the files seen. This can be useful with simulated backups, especially with huge filesystems, to try different backup options. Samples are not truly random, but are selected so that if 5% of files are backed up first, then 10%, the new sample includes the same files as the first 5% plus an additional 5%. This allows simulating incremental backups. All pathnames are sampled, whether they come from the command line or -F file sources.

-t tagname tag the backup with information-only text shown with hb versions

-V backup splits files into variable-sized blocks unless the -B option or block-size-ext config options override this. The default variable block size is automatically chosen when config option block-size is set to auto, the default. -V overrides the variable block size for a single backup. Values can be 16K 32K 64K 128K 256K 512K or 1M. During backup of a file the block size ranges from half to 4x this block size, with the average block size being about 1.5 times larger than specifed. Larger block sizes dedup less but have less block tracking overhead in the backup database, leading to faster restore planning, faster rm and retain operation, and a smaller backup database. Larger blocks also compress better. If most of the files being saved are on the large side, a larger block size is recommended. If the block size is changed on an existing backup, changed files will be saved with the new block size and will not dedup against previously saved versions because they have a different block size. If a file continues to change, it will dedup against the versions saved with the same block size.

-X backup saves all paths listed on the command line, even if they are on different filesystems. But while saving / (root) for example, backup will not descend into other mounted filesystems unless -X is used. Be careful: it’s easy to backup an NFS server or external drive unintentionally with -X.

-Z level sets the compression level. The compression level is 0 to 6, where 0 means no compression, 6 means highest compression, and 7-9 are reserved. The default is 6, the highest compression, so use -Z for slightly faster but less compression. Compression can be disabled by file extension with the no-compress-ext config variable. Disabling compression with -Z0 usually increases backup space, disk writes, and backup transmission time, and is almost always slower than -Z1.

Raw Block Device, Partition, and Logical Volume Backups

If a device name is used on the backup command line, all data stored on the device will be stored. The mtime and ctime fields cannot be used to detect changes in a block device, so the whole device is read on every backup, though only modified data is saved if dedup is enabled. The block device should either be unmounted or mounted read-only before taking the backup, or, with logical volumes, a snapshot can be taken and then the snapshot backed up. To maximize dedup, a block size of 4K is good for most block device backups since this is the most common filesystem allocation unit. For very large devices, a block size of 64K or 1M will run faster, but may yield a bigger backup with less dedup. Experiment with your data to decide. When saving multiple block devices to the same backup directory, they will only dedup well against each other if the backup block size matches the filesystem allocation unit size - usually 4K. This is similar to the situation with VM image backups; read below for more information.

On Linux, logical volumes are given symbolic links that point to the raw device. When a symbolic link is listed on the backup command line, its target (the raw device) will also be saved.

Named Pipes / Fifo Backups

If a named pipe or fifo is used on the backup command line, the fifo is opened and all data read from it is backed up. For example, this can be used for database dumps: instead of dumping to a huge text file, then backing up the text file, you can direct the dump output to a fifo and let HB back up the dump directly, without needing the intermediate text file. The commands to do a fifo backup are:

$ mkfifo myfifo

$ cat somefile >myfifo & hb backup -c backupdir myfifo

Be very careful not to start more than one process writing to the same fifo! If this happens, the data from the two processes is intermixed at random and HB cannot tell this is happening. The resulting backup is unusable.

On Mac OSX and BSD, fifo backups are about half the speed of backups from a regular file because the fifo buffers are small. With Linux, fifo backups can be faster than regular file backups.

Simulated Backups

It is often difficult to decide the best way to back up a specific set of data. There may be millions of small files, very large files, VM disk images, photos, large mail directories, large databases, or many other types of data. Different block sizes (-B option), arc file sizes (arc-size-limit config keyword), dedup table sizes (-D keyword), and arc file packing (pack-… config keywords) will all affect the size of the backup. The retention policy will also be a factor. Determining all of these variables can be difficult.

To make configuring backups easier, HB has simulated backups: the backup works as usual, but no backup data is written. The metadata about the backup - file names, blocks, arc files, etc. - is still all tracked. You can do daily incremental backups, remove files from the backup, and perform file retention, just like a normal backup, while using very little disk space. The hb stats command is then used to see how much space the backup would use. It’s easy to try one configuration for a week or two, put that aside, try a different configuration for a while, and then compare the hb stats output for the two backups to see which works better.

To create a simulated backup, use:

$ hb config -c backupdir simulated-backup True

before the first backup. Then use normal HB commands as if this were a real backup and use the stats command to see the results. Once this option has been set and a backup created with it, it cannot be removed. It also cannot be set on an existing backup.

Some commands like get and mount will fail with a simulated backup since there is no real backup data. A useful option for very large simulated backups is --sample, to select only a portion of a filesystem for the simulation.

Sparse Files

A sparse file is a somewhat unusual file where disk space for long runs of zeroes is not fully allocated. These are often used for virtual disk images and sometimes called "thin provisioning". An ls -ls command shows a number on the left, the space actually allocated to the file in 512-byte blocks, and the size of the file in bytes, which may be much larger. When supported by the OS, HB will backup a sparse file efficiently by skipping the unallocated file space, also called "holes", rather than backing up a long string of zero bytes. If the OS or filesystem does not support sparse file hole skipping, the file will be saved normally. In either case, a sparse file is always restored as a sparse file with holes. Any -B block size can be used (or none, and HB will pick a block size).

Sparse files are supported on Linux and BSD; Mac OS does not support sparse files. Here is an example of creating a 10GB sparse file, backing it up, and restoring it:

$ echo abc|dd of=sparsefile bs=1M seek=10000
0+1 records in
0+1 records out
4 bytes (4 B) copied, 0.000209 seconds, 19.1 kB/s

$ hb backup -c hb -D1g -B4k sparsefile
HashBackup build #1463 Copyright 2009-2016 HashBackup, LLC
Backup directory: /home/jim/hb
Copied HB program to /home/jim/hb/hb#1463
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
/home/jim/sparsefile

Time: 1.0s
Checked: 5 paths, 10485760004 bytes, 10 GB
Saved: 5 paths, 4 bytes, 4 B
Excluded: 0
Sparse: 10485760000, 10 GB
Dupbytes: 0
Space: 64 B, 139 KB total
No errors

$ mv sparsefile sparsefile.bak

$ time hb get -c hb `pwd`/sparsefile
HashBackup build #1463 Copyright 2009-2016 HashBackup, LLC
Backup directory: /home/jim/hb
Most recent backup version: 0
Restoring most recent version

Restoring sparsefile to /home/jim
/home/jim/sparsefile
Restored /home/jim/sparsefile to /home/jim/sparsefile
No errors

real    0m0.805s
user    0m0.150s
sys     0m0.100s

$ ls -ls sparsefile*
20 -rw-rw-r-- 1 jim jim 10485760004 Feb  8 18:03 sparsefile
20 -rw-rw-r-- 1 jim jim 10485760004 Feb  8 18:03 sparsefile.bak

Virtual Machine Backups

There are 2 ways to backup a VM: run HashBackup inside the VM as an ordinary backup, or run HashBackup from the VM host machine. Running HB inside the VM is just like any other backup and doesn’t have special considerations. VM images can be backed up while running HashBackup on the VM host, ie, outside the VM. What you are saving is a large file or collection of large files, often sparse, that together are the data for the virtual machine. This has 2 special considerations: consistent backups, and choosing a block size.

Consistent VM Image Backups

If a VM image is backed up while the VM is running, the disk image files being saved may change during the backup. For non-critical VMs that don’t have a lot of activity, this isn’t a big deal and is probably fine. If you restore a VM image saved this way, it is strongly recommended to run a forced fsck if you restore it, to ensure your filesystem is okay.

To get a consistent VM backup, you can: a) suspend the VM during the backup; b) take a snapshot with the VM software, or c) take a snapshot on the VM host filesystem, for example, an LVM snapshot, and backup the snapshot. With method b), you would need to revert to the snapshot after restoring the VM. Using method b), VM snapshots, has the additional advantage that you can run with this snapshot active for a week or two, and backups will be much faster because only the snapshot has to be scanned; the main VM image file will not change. Every week or so the snapshot(s) can be deleted, reincorporating them back into the main VM image, and then the entire VM image will be scanned on the next backup.

Block Size for VM Image Backups

HashBackup normally uses a variable block size that depends on the data being saved. For files that are updated in fixed-size blocks, like VM images, a fixed block size usually gives a smaller backups and faster backup performance.

Smaller block sizes yield higher dedup, though also cause more overhead because HB has to track more blocks. If HashBackup recognizes the file suffix for a VM image, for example, .vmdk, .qcow, etc., it will switch to a 4K block size. For higher performance, especially with very large VM images, a larger block size like -B1M could be used. For a "middle of the road" strategy, use -B64K.

Important note: if you expect to dedup across several VM images saved into the same backup directory, the most effective way to do that is with -B4K. The reason is that different VMs will likely have different filesystem layouts, even if they have the same contents. Since most filesystems allocate space in 4K chunks, the best way to dedup across VMs is to use the same 4K block size. ` `

When choosing a block size, here are some things to keep in mind:

How big is the VM? For small VMs, the block size won’t make much difference, so don’t worry about it.
How much dedup do you expect / want? A smaller block size will dedup better, a larger block size will save and restore faster.
Do you expect dedup across VM images? If so, you probably have to use -B4K.
If you only want to dedup a VM image against itself, from one incremental backup to the next, you can use any block size. Larger block sizes will create a larger backup because of less dedup but will run faster and restore faster. Smaller block sizes will probably create a smaller backup but take longer to backup and restore.
If you want to dedup a VM image backup against itself within one backup rather than between incrementals, you will need to use -B4K. Example: a VM image contains two identical 1 GB files. HashBackup will only reliably dedup these files with -B4K because the individual blocks of the two files could be scattered throughout the VM image, depending on how the filesystem inside the VM decides to allocate space.
How much data changes between backups? For a very active VM, a smaller block size will create smaller backups but a larger block size will run faster. For an inactive VM, not much data will be saved anyway, so don’t worry about the block size.
Do you need fast restores? A smaller block size will take longer because there will be more seeks during the restore; a larger block size will restore faster. It’s hard to quantify, so do test backups and restores to compare restore times.
These trade-offs are very data dependent. To see what works best for your VM images, it’s recommended to start with simulated backups (see simulated-backup config option). You can run simulated backups for a few days with one block size, then create a new simulated backup for a few days with a different block size, and compare how long the backups run vs how much backup space is used. Use hb stats to check backup space used by a simulated backup.

Example #1: /usr Backup

Backup of CentOS /usr directory running in a Mac OSX VM. First create the backup directory:

[root@cent5564 /]# hb init -c /hbbackup
HashBackup build #1605 Copyright 2009-2016 HashBackup, LLC
Backup directory: /hbbackup
Permissions set for owner access only
Created key file /hbbackup/key.conf
Key file set to read-only
Setting include/exclude defaults: /hbbackup/inex.conf

VERY IMPORTANT: your backup is encrypted and can only be accessed with
the encryption key, stored in the file:
    /hbbackup/key.conf

You MUST make copies of this file and store them in a secure location,
separate from your computer and backup data.  If your hard drive fails,
you will need this key to restore your files.  If you setup any
remote destinations in dest.conf, that file should be copied too.

Backup directory initialized

Drop the filesystem cache to time the initial (full) backup. This backup is slower than normal because this VM is only configured for 1 core.

[root@cent5564 /]# echo 3 >/proc/sys/vm/drop_cache

[root@cent5564 /]# time hb backup -c /hbbackup -D1g -v1 /usr
HashBackup build #1605 Copyright 2009-2016 HashBackup, LLC
Backup directory: /hbbackup
Copied HB program to /hbbackup/hb#1605
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
Backing up: /hbbackup/inex.conf
Backing up: /usr

Time: 190.1s, 3m 10s
CPU:  125.5s, 2m 5s, 66%
Checked: 72311 paths, 1422698489 bytes, 1.4 GB
Saved: 72311 paths, 1227204836 bytes, 1.2 GB
Excluded: 0
Dupbytes: 175353351, 175 MB, 14%
Compression: 67%, 3.1:1
Efficiency: 6.26 MB reduced/cpusec
Space: 402 MB, 402 MB total
No errors

real    3m10.617s
user    1m13.299s
sys    0m53.212s

Drop the filesystem cache again to time an incremental backup:

[root@cent5564 /]# echo 3 >/proc/sys/vm/drop_caches

[root@cent5564 /]# time hb backup -c /hbbackup -D1g -v1 /usr
HashBackup build #1605 Copyright 2009-2016 HashBackup, LLC
Backup directory: /hbbackup
Cores: have 1, init 1, max 1
This is backup version: 1
Dedup enabled, 21% of current, 0% of max
Backing up: /hbbackup/inex.conf
Backing up: /usr

Time: 22.1s
CPU:  19.1s, 86%
Checked: 72311 paths, 1422698489 bytes, 1.4 GB
Saved: 2 paths, 0 bytes, 0
Excluded: 0
No errors

real    0m22.603s
user    0m5.949s
sys    0m13.608s

Example #2: Backup & Restore Mac Mini Server

Backup Intel i7 Mac Mini Server root drive (643,584 files in 89GB) to the other drive. Both drives are slower 5600 rpm hard drives. During this backup, HashBackup used 267MB of RAM,

sh-3.2# purge; /usr/bin/time -l hb backup -c /hbtest -v1 -D1g /
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /hbtest
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
Backing up: /
Mount point contents skipped: /dev
Mount point contents skipped: /home
Mount point contents skipped: /net
Backing up: /hbtest/inex.conf

Time: 2323.1s, 38m 43s
CPU:  2004.4s, 33m 24s, 86%
Checked: 643631 paths, 89316800581 bytes, 89 GB
Saved: 643584 paths, 89238502003 bytes, 89 GB
Excluded: 47
Dupbytes: 21638127507, 21 GB, 24%
Compression: 36%, 1.6:1
Efficiency: 15.32 MB reduced/cpusec
Space: 57 GB, 57 GB total
No errors

     2325.97 real      1746.39 user       258.65 sys
 267079680  maximum resident set size
   3804171  page reclaims
      1019  page faults
         0  swaps
     88820  block input operations
     37261  block output operations
    823626  voluntary context switches
   1314791  involuntary context switches

Restore the backup - 28 minutes.

sh-3.2# purge; /usr/bin/time -l hb get -c /hbtest -v1 /
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /hbtest
Most recent backup version: 0
Restoring most recent version
Begin restore

Restoring / to /Users/jim/test
Restore? yes
Restored / to /Users/jim/test
0 errors
     1672.70 real      1333.01 user       190.00 sys
 276615168  maximum resident set size
   1388049  page reclaims
       964  page faults
         0  swaps
      2732  block input operations
     53260  block output operations
     35939  voluntary context switches
   3865449  involuntary context switches*

Example #3: Incremental backup of Mac Mini with 3600 daily backups over 11 years

This is the daily backup of the HashBackup development server running on a 2010 Mac Mini (Intel Core 2 Duo 2.66GHz) with 2 SSD drives. There are 12 virtual machines hosted on this server so it averages about 80% idle. The backup of the main drive is written to the other SSD, copied to an external USB 2.0 drive, and copied to Backblaze B2 over a very slow 1Mbit/s (128KB/s) Internet connection.

HashBackup #2527 Copyright 2009-2021 HashBackup, LLC
Backup directory: /hbbackup
Backup start: 2021-11-26 02:00:02
Using destinations in dest.conf
This is backup version: 3609
Dedup enabled, 60% of current size, 12% of max size
/Library/Caches/com.apple.DiagnosticReporting.Networks.plist
/Library/Caches
...
/private/var
/private
/
Copied arc.3609.0 to usbcopy (9.5 MB 1s 8.9 MB/s)
Waiting for destinations: b2
Copied arc.3609.0 to b2 (9.5 MB 1m 7s 141 KB/s)
Checking database before upload
Writing hb.db.6321
Copied hb.db.6321 to usbcopy (178 MB 5s 32 MB/s)
Waiting for destinations: b2
Copied hb.db.6321 to b2 (178 MB 19m 21s 153 KB/s)
Copied dest.db to usbcopy (9.6 MB 0s 10 MB/s)
Waiting for destinations: b2
Copied dest.db to b2 (9.6 MB 1m 5s 147 KB/s)
Removed hb.db.6309 from b2
Removed hb.db.6310 from b2
Removed hb.db.6311 from b2
Removed hb.db.6309 from usbcopy
Removed hb.db.6310 from usbcopy
Removed hb.db.6311 from usbcopy

Time: 826.2s, 13m 46s
CPU:  963.9s, 16m 3s, 116%
Wait: 1561.9s, 26m 1s
Mem:  394 MB
Checked: 564451 paths, 79881638243 bytes, 79 GB
Saved: 155 paths, 23765297154 bytes, 23 GB
Excluded: 39
Dupbytes: 23663773091, 23 GB, 99%
Compression: 99%, 2480.9:1
Efficiency: 23.50 MB reduced/cpusec
Space: +9.5 MB, 57 GB total
No errors
Exit 0: Success

There are many VM disk images stored on this server, and a tiny change in these disk images means the whole disk image has to be backed up. In this backup, out of 79GB total, files totalling 23GB were changed. But because the amount of actual data changed was small, HashBackup’s dedup feature compressed a 23GB backup down to 9.5MB of new backup data. The backup was also copied to a USB drive for redundancy. A full-drive restore from this backup is shown on the Get page.

Example #4: Full and incremental backup of 2GB Ubuntu VM image

First backup a 2GB Ubuntu test VM image with 64K block size. This is on a 2010 Macbook Pro, Core 2 Duo 2.5GHz, SSD system. The peak HB memory usage during this backup was 77MB. The purge command clears all cached disk data from memory.

$ purge; hb backup -c hb -B64K -D1g ../testfile
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /Users/jim/hbrel/hb
Copied HB program to /Users/jim/hbrel/hb/hb#1619
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
/Users/jim/hbrel/hb/inex.conf
/Users/jim/testfile

Time: 36.9s
CPU:  61.6s, 1m 1s, 166%
Checked: 7 paths, 2011914751 bytes, 2.0 GB
Saved: 7 paths, 2011914751 bytes, 2.0 GB
Excluded: 0
Dupbytes: 95289344, 95 MB, 4%
Compression: 56%, 2.3:1
Efficiency: 17.63 MB reduced/cpusec
Space: 873 MB, 873 MB total
No errors

Now touch the VM image to force an incremental backup. No data actually changed, but HashBackup still has to scan the file for changes. Peak memory usage during this backup was 77MB.

$ touch ../testfile; purge; hb backup -c hb -D1g -B64K ../testfile
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /Users/jim/hbrel/hb
This is backup version: 1
Dedup enabled, 13% of current, 0% of max
/Users/jim/testfile

Time: 15.1s
CPU:  18.4s, 121%
Checked: 7 paths, 2011914751 bytes, 2.0 GB
Saved: 6 paths, 2011914240 bytes, 2.0 GB
Excluded: 0
Dupbytes: 2011914240, 2.0 GB, 100%
Compression: 99%, 49119.0:1
Efficiency: 104.30 MB reduced/cpusec
Space: 40 KB, 873 MB total
No errors

Try the same backup and incremental, this time with a variable block size. This gives 2x more dedup and a smaller backup, but 5% higher runtime. Peak memory usage again was around 78MB.

$ purge; hb backup -c hb -D1g ../testfile
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /Users/jim/hbrel/hb
Copied HB program to /Users/jim/hbrel/hb/hb#1619
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
/Users/jim/hbrel/hb/inex.conf
/Users/jim/testfile

Time: 40.9s
CPU:  66.7s, 1m 6s, 163%
Checked: 7 paths, 2011914751 bytes, 2.0 GB
Saved: 7 paths, 2011914751 bytes, 2.0 GB
Excluded: 0
Dupbytes: 162587404, 162 MB, 8%
Compression: 57%, 2.4:1
Efficiency: 16.56 MB reduced/cpusec
Space: 853 MB, 853 MB total
No errors

Now an incremental backup with variable block size. This shows that for VM images, variable-block incremental backup is slower than fixed-block backup because data has to be scanned byte-by-byte.

$ purge; touch ../testfile; hb backup -c hb -D1g ../testfile
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /Users/jim/hbrel/hb
This is backup version: 1
Dedup enabled, 16% of current, 0% of max
/Users/jim/testfile

Time: 21.4s
CPU:  24.8s, 115%
Checked: 7 paths, 2011914751 bytes, 2.0 GB
Saved: 6 paths, 2011914240 bytes, 2.0 GB
Excluded: 0
Dupbytes: 2011914240, 2.0 GB, 100%
Compression: 99%, 40932.5:1
Efficiency: 77.36 MB reduced/cpusec
Space: 49 KB, 853 MB total
No errors

Let’s compare how fast dd can read the test VM image vs HashBackup, both with a block size of 64K. The results below show dd reads at 233 MB/sec on this system, HashBackup reads and compares to the previous backup at 133 MB/sec (if no data changes). The difference between the full backup time where all data is saved, and the incremental time where no data is saved, gives an idea of the time HashBackup needs to backup changed data. Based on this, HashBackup can backup 1% of changed data for this file in about .217 seconds, or 2.17 seconds if 10% of data changed. This is added to the incremental time with no changes to estimate the backup time with changed data. You can use similar formulas to estimate backup times of large VM files.

IMPORTANT: these tests are with a 2010 Intel 2.5GHz Core2Duo CPU. More recent CPUs will have faster backup times.

$ purge; dd if=../testfile of=/dev/null bs=64k
30699+1 records in
30699+1 records out
2011914240 bytes transferred in 8.638740 secs (232894413 bytes/sec)

# compute how fast HashBackup did an incremental backup and
# estimate how long it takes to backup 1% changed data

$ bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.

2011914240/15.1
133239353.642   <==  HB incremental backup rate was 133.23 MB/sec for unchanged VM image

(36.9-15.1)/100
.218            <==  approx seconds to backup 1% changed data in the test VM image

Do an incremental backup with 10% of the 2GB VM image data changed at random, to verify the expected runtime of 21.8 seconds for 10% changed data:

$ purge; time hb backup -c hb -D1g -B64K ../testfile
Backup directory: /Users/jim/hbrel/hb
Group test & size:  262144 1048576
This is backup version: 2
Dedup enabled, 13% of current, 0% of max
/Users/jim/testfile

Time: 17.2s                              <== runtime is a bit lower than expected - yay!
CPU:  23.1s, 134%
Checked: 7 paths, 2011914751 bytes, 2.0 GB
Saved: 6 paths, 2011914240 bytes, 2.0 GB
Excluded: 0
Dupbytes: 1816027136, 1.8 GB, 90%        <== shows 10% changed data
Compression: 95%, 22.2:1
Efficiency: 79.42 MB reduced/cpusec
Space: 90 MB, 964 MB total
No errors

NetApp Snapshot Backups

NetApp Filers have a built-in snapshot capability. Snapshots can be taken automatically with generated pathnames, or manually with a specified pathname. An example of an automatic name would be /nfs/snapshot.0, /nfs/snapshot.1, etc., with the higher number being the most recent snapshot. Saving the highest snapshot is a problem for HashBackup because the pathname changes on every backup, causing unreasonable metadata growth in the hb.db database file.

To make efficient backups, use a bind mount to make the snapshot directory appear at a fixed name:

$ mount --bind /nfs/snapshot.N /nfstree (Linux)

Then backup /nfstree with HashBackup.

Dirvish Backups

Dirvish is an open-source backup tool that uses rsync to create hard-linked backup trees. Each tree is a complete snapshot of a backup source, but unchanged files are actually hard links. This saves disk space since file data is not replicated in each tree. The problem is that these trees often include a timestamp in the pathname. If the trees are backed up directly, every pathname is unique. This causes unreasonable metadata growth in the hb.db file, that leads to excessive RAM usage.

For more efficient backups, mount the tree directory (the directory containing user files) to a fixed pathname with a Linux bind mount. See the NetApp section above for details.

A second issue with hard-link backups is that HashBackup maintains an in-memory data structure for hard link inodes so that it can properly relate linked files. This data structure is not very memory efficient and uses ~165 bytes for each inode that is a hard link. This can be disabled with the --no-ino option, though that disables all hard-link processing.

Rsnapshot Backups

Rsnapshot is an rsync-based backup tool that creates hard-linked backup trees. A big difference is that with rsnapshot, the most recent backup is contained in the .0 directory, ie, the most recent backup is at a fixed mount point. HashBackup can be used to backup this .0 directory every night. You should avoid backing up the other rsnapshot directories. If you need to save the other directories:

rename backup.0 to backup.0.save
rename the rsnapshot directory you want to backup as backup.0
save backup.0
rename the directory back to its original name
repeat these steps for every rsnapshot directory you want to backup
when finished, rename backup.0.save to backup.0
going forward, only save backup.0 with HashBackup
instead of renaming you can use the bind mount trick (see the NetApp section)

Since rsnapshot backups are hard-link trees, nearly every file is a hard link and you may want to consider using --no-ino to lower memory usage. This disables all hard-link processing in HashBackup.

Archiving

HashBackup is designed as a backup solution, where a collection of files are saved on a rather frequent basis: daily, monthly, etc. An archive is a special kind of backup that is more of a one-time event, for example, saving a hard drive or server that is going to be decomissioned. It could be 5 years (or never!) before the archive is accessed again.

HashBackup uses a database to store information about a backup. The format of this database is periodically changed as new features are added, and any structural changes are automatically applied as necessary when a new release of HB accesses a backup. HB can automatically update any backup as long as it is accessed within the last year or so with the latest version. If it has been more than a year, the latest version of HB may or may not be able to automatically update a backup’s database; an older version may need to be used first to partially upgrade the database, then a later version used to fully upgrade the database.

One easy way to keep an archive’s database up to date is to create a cron job to backup a very small file every month. This file doesn’t have to be related in any way to the original archive - a dummy file in /tmp is fine. Also include an hb upgrade in the cron job. Together these will ensure that your copy of the HB program stays up-to-date and that your archive’s database is automatically upgraded as necessary.

Another way to handle archives is to always run the version of HB contained within the backup directory.

Syncing a New Destination

When a new destination is added to dest.conf, HashBackup will automatically sync existing backup data to the new destination. This can be somewhat challenging if it will take days or weeks to get the new destination in sync, either because upload bandwidth is limited or the backup is already very large. An easy way to handle this is to use the --maxwait and/or --maxtime options to limit the amount of time backup spends trying to get your new destination synced. For example, if backup normally takes an hour to run and starts at midnight, --maxtime 7h will stop the backup around 7am. The next day backup will continue syncing data to the new destination.

Another option is to allow backup to run most of the day. To avoid interfering with normal network activity, use workers 1 in dest.conf so only one file is uploaded at a time, and use the rate keyword to limit upload bandwidth used. Once your new destination is in sync, these restrictions can be removed since incremental backups are usually small.