Performs a backup. Advanced options are described later, but commonly-used options are:
$ hb backup [-c backupdir] [-D mem] [-v level] [-X] [-t tagname] path1 path2 ...
-c the backup is stored in the local backup directory specified with
-c option or a default backup directory according to built-in
-c isn’t used. If you don’t want a local copy of the
cache-size-limit with the
config command and create a
dest.conf file in the backup directory. See
Destinations for details and examples.
-D enable dedup, saving only 1 copy of identical data blocks .
-D is followed by the amount of memory you want to dedicate to the dedup operation, for example:
-D1g would use 1 gigabyte of RAM,
-D100m would use 100mb. The more memory you have available, the more duplicate data HashBackup can detect, though it is still effective even with limited memory. Dedup is also enabled with the
dedup-mem config option. See Dedup for detailed information and recommendations.
-D0 disables dedup for this backup. Unlike most options,
-D uses a factor of 1024, so
-D100M means 100 * 1024 * 1024 bytes. Dedup is disabled by default since the memory limit is system-dependent.
-v control how much output is displayed, higher numbers for more:
-v0 = print version, backup directory, copy messages, error count
-v1 = print names of skipped filesystems
-v2 = print names of files backed up (this is the default)
-v3 = print names of excluded files
-X backup will not cross mount points (separate file systems) unless
-X is used, so you have to explicitly list on the command line each
filesystem you want to backup. Use the
df command to see a list of
your filesystems. To see what filesystem a directory is on, use
followed by a pathname, or use
df . to see the current directory’s
-t tag the backup with information-only text shown with
Each backup creates a new version in the backup directory containing all files modified since the previous backup. An easy way to think of this is a series of incremental backups, stacked one on top of the other. HashBackup presents the illusion that every version is a full backup, while performing the backups at incremental speeds and using incremental backup storage space and bandwidth. While full backups are an option, there is no requirement to do them. HashBackup is designed to not need full backups and you will experience huge savings of time, disk space, and bandwidth compared to traditional backup methods. The
retain command explains how to maintain weekly or monthly snapshots - or both!
Specifying the Backup Directory
All commands accept a
-c option to specify the backup directory.
HashBackup will store your backup data and the encryption key in
this directory. If
-c is not specified on the command line, the
HASHBACKUP_DIR is checked. If this environment
variable doesn’t exist, the directory
~/hashbackup is used if it
~/hashbackup doesn’t exist,
/var/hashbackup is used.
Backup directories must be first be initialized with the
command. If you do not want a complete local copy of your backup, set
cache-size-limit config option with the
config command to
limit the size of the local backup directory, then setup a
file to send backup data to a destination.
HashBackup saves all of your files' properties in a database, so backups can be stored on any type of "dumb" storage: FAT, VFAT, CIFS/SMB/Samba, NFS, USB thumb drives, SSHFS, WebDAV, FTP servers, etc. All file attributes, including hard links, ACLs, and extended attributes will be accurately saved and restored, even if not supported by the storage system. An encrypted incremental backup of the backup database is automatically included with every backup.
It’s possible to backup directly to a mounted remote filesystem or USB
drive with the
-c option. However, the backup performance may be
slower than with a local filesystem, and these may not provide robust
sync and lock facilities. Sync facilities are used to ensure that the
backup database doesn’t get corrupted if your computer halts
unexpectedly during the backup process, for example, if there is a
power outage. Lock facilities ensure that two backups don’t run at
the same time, possibly overwriting each others' backup files. It’s
best to specify a local backup directory with
-c and let HashBackup
copy the backup data to mounted remote storage or USB drive by using a
Dir destination in the
dest.conf file in your backup directory.
Remote destinations are optional. They are setup in the
file, located in the backup directory. See the
Destinations page for more detailed
info on how to configure destinations for SSH, RSYNC, and FTP servers,
Amazon S3, Google Storage, Backblaze, Gmail, IMAP/emailservers, and
simple directory (dir) destinations. Dir destinations are used to
copy your backup to USB thumb drives, NFS, WebDAV, Samba, and any
other storage that is mounted as a filesystem.
The backup command creates backup archive files in the backup
directory. After arc files fill up, they can be transmitted to one or
more remote destinations, configured with a
Transmissions run in parallel with the backup process. After the
backup has completed, it waits for all files to finish transmitting.
By default, archives are kept in the local backup directory after
sending to remote destinations, making restores very fast because data
does not have to be downloaded from a remote site. Keeping a local
copy also adds redundancy and makes it easier to pack remote archives
(remove deleted or expired backup data) because archives don’t have to
be downloaded first. It is possible to delete archives from the local
backup directory as they are sent to destinations by using the
cache-size-limit config option.
If you are storing your backup directly to a USB or other removable
drive with the
-c option, and that’s the only backup copy you want,
you do not need to use the
dest.conf file at all.
only used when you want to copy your backup, usually to a remote site.
Important Security Note: your encryption key is stored in the file
key.conf in the backup directory. Usually this is a local
directory, but if you are writing your backup directly to mounted
remote storage with the
-c option, for example, Google Drive,
DropBox, etc., or to a removable USB drive, be sure to set a
passphrase when initializing your backup directory. This is done
-p ask option to
hb init or
hb rekey. Both the
passphrase and key are required to access your backup. If you are
writing directly to mounted remote storage that you control, such as
an NFS server, a passphrase may not be necessary. If you are using a
Dir destination to copy your backup to remote storage
or a removable drive, but your backup directory and key are on a
local drive, a passphrase is not necessary because the key is never
copied to destinations.
A handful of system files and directories are automatically excluded,
/sys, and hibernation
/var/hashbackup and the
-c backup directory itself are
also automatically excluded. An
inex.conf file is created by the
init command in the backup directory showing which files are
To exclude other files, edit the
inex.conf file in the backup
directory. The format of this file is:
# comment lines E(xclude) <pathname>
# exclude all .wav files: Exclude *.wav # exclude the /Cache directory and its contents: e /Cache # save the /Cache directory itself, but don't save the contents. # Requesting a backup of /Cache/xyz still works. e /Cache/ # save the /Cache directory itself, but don't save the contents. # Requesting a backup of /Cache/xyzdir will save the directory itself # since it was explicitly requested, but will not save xyzdir's # contents e /Cache/*
Any abbreviation of the exclude keyword at the beginning of the line
is recognized. Tilde characters
~ are not expanded into user
There are several other ways to exclude files from the backup:
no-backup-tagcan be set to a list of filenames, separated by commas. If a directory contains any of these files, only the directory itself and the tag files are saved. HB does not read or check the contents of tag files. Typical values here are
files with the nodump flag set are not backed up (but on Linux, you must set config option
no-backup-extcan be set to a list of comma-separated extensions, with or without dots. Any files ending in one of the extensions listed is not backed up. This test is done without regard to uppercase or lowercase, so
aviwill exclude both
Web browser caches are difficult to backup for a number of reasons (see Browser Caches note), and should always be excluded from backups, especially on large servers that store browser caches for many users.
The backup command has a few more options that may be useful:
-B blocksize the backup command splits files into blocks of data.
It chooses a "good" block size based on the type of file being saved
and by default uses variable-sized blocks. Variable block sizes work
well for files that change in unpredictable ways, like a text file.
For files that change in fixed-size units - raw database files (not
text dumps!), VM images, etc. - you may want to force a specific,
fixed block size. Fixed-size blocks can be anywhere from 128 bytes to
2GB-1 (2,147,483,647) bytes.
-B forces a fixed block size, which usually dedups less than a
variable block size except for cases where a file is updated in
fixed chunks; then it is usually more efficient than variable blocks.
Larger block sizes are more efficient in general but dedup less; small
block sizes cause more overhead but dedup better. Sometimes the extra
dedup savings of a small block size is less than the extra metadata
needed to track more blocks. Since this is very data dependent,
experiment with your actual data to determine the best block size to use.
-B1m usually speeds up backups on a single CPU system because
it disables variable block sizes.
-B4M or higher is useful when
files do not have much common data, for example media files. If there
are identical files, backup will only save 1 copy. A huge block size
like 64M may be useful for large scientific data files that will have
little duplicate data. Using a large block size does not make small
files use more backup space. Using the
option with different block sizes is very helpful for determining the
best block size. Unlike most options,
-B uses a multiplier of 1024,
-B4K means 4096 bytes rather than 4000 bytes.
IMPORTANT: very large block sizes will use a lot of RAM. A good
estimate is 8x block size with a couple of CPUs, or more if there are
more CPUs. Using
-p0 disables multi-threading and will use less RAM
and CPU but will also be slower.
-F filesources read pathnames to be saved from file sources. More
than one file source can be used. This is often used with very large
file systems to avoid having to scan for modified files by having the
processes that modify files update a file list. Each file source is
pathname of a file containing pathnames to backup, 1 per line. Blank lines and lines beginning with # (comments) are ignored. Pathnames in the file can be either individual files or whole directories to save. If a pathname doesn’t exist in the filesystem but is present in the backup, it is marked deleted in the backup.
pathname of a directory, where each file in the directory contains a list of pathnames to backup as above.
The backup command line can still have pathnames to backup in addition
-F, but these must come before
--full the backup command usually backs up only modified files and
determines this by looking at mtime, ctime, and other file attributes
for each file. This option overrides that behavior and causes every
file to be saved, even if it has not changed. Taking a full backup
adds redundancy to your backup data. Another way to accomplish this
is to create a new backup directory on both local and remote storage,
but this is harder to manage and retain will not work across multiple
backup directories. Full backups also limit the number of incremental
archives required to do large restores, though in testing, restoring a
complete OSX system from 3 years of daily incrementals took only 15%
longer than restoring from a full backup.
-m maxfilesize skip files larger than
maxfilesize. The size can
be specified as 100m or 100mb for 100 megabytes, 3g or 3gb for 3
gigabytes, etc. This limit does not apply to fifo or block device
backups because they always have a size of zero.
--maxtime time specifies the maximum time to spend actually saving
files. When this time is exceeded, the backup stops and waits for
uploads to finish using
--maxwait, which is adjusted based on how
long the backup took.
--maxwait time when used by itself, specifies the maximum time to
wait for files to be transmitted to remote destinations. When used
--maxtime, the wait time is reduced by the time taken to create
These two options are useful for huge initial backups, which take much
longer than incremental backups. They allow huge backups to span many
days without going outside the time reserved for backups. They also
prevent incrementals from running into production time when a large
amount of data changes for some reason. If a backup does not complete
--maxtime, the next backup using the same command line
will restart where the previous backup left off, without rescanning
the old data. This allows backups of huge filesystems with tens of
millions of files over a period of time. Be aware that these time
limits are not very accurate for various technical reasons. Large arc
files, large files, slow upload rates, and many worker threads cause
more variability in the timing. It’s a good idea to set the time
limits an hour less than you really want to get a feel for how much
variability occurs in your backup environment.
--maxtime 1h means backup for up to 1 hour then wait the remainder
of the hour to upload the data, ie, total time is limited to 1 hour
--maxwait 1h means backup everything requested, however long it
takes, but only wait 1 hour for uploads to finish
--maxtime 1h --maxwait 1h means backup for up to 1 hour, then wait 1
hour + the remaining backup time for uploads to finish, ie, the total
time is limited to 2 hours
--maxtime 1h --maxwait 1y means backup for 1 hour, then wait a year
for uploads to finish, ie, only the backup time is limited
--no-ino on Unix filesystems, every file has a unique inode number.
If two paths have the same inode number, they are hard linked and
both paths refer to the same file data. Some filesystems do not have
real inode numbers: FUSE (sshfs, UnRaid) and Samba/SMB/CIFS for
example. Instead, they invent inode numbers and cache them for a
while, sometimes for many days. HashBackup normally verifies a path’s
inode number during backups, and if it changes, it triggers a backup
of the path. Inode numbers are also used to detect hard links. If a
filesystem does not have real inode numbers, it causes unexpected and
unpredictable full backups of files that haven’t changed and after
enough of these mistakes, HashBackup will stop with an error. The
--no-ino option prevents these unnecessary full backups.
A negative side-effect of
--no-ino is that hard-link detection is
disabled since it relies on stable inode numbers. With
hard-linked files are backed up like regular files and if restored,
are not hard-linked.
--no-mtime there are two timestamps on every Unix file: mtime is the
last time a file’s data was modified, and ctime is the last time a
file’s attributes were modified (permissions, etc). When a file’s
data is changed, both mtime and ctime are updated. If only the
attributes are changed, only ctime is updated. Because mtime can be
set by user programs, some administrators may not trust it to indicate
whether a file’s data has changed. The
--no-mtime option tells the
backup command to verify whether file contents have changed by
computing a strong checksum and checking that with the previous backup
rather than trusting mtime. This is not recommended for normal use
since the backup program has to read every unmodified file’s data to
compute the checksum, then possibly read it again to backup the file.
-p procs specifies how many additional processes to use for the
backup. Normally HashBackup uses 1 process on single-CPU systems, and
several processes on multi-core systems with more than 1 CPU.
Multiple processes speed up the backup but also put a higher load on
your system. To reduce the performance impact of a backup on your
system, you may want to use
-p0 to force using only 1 CPU. Or, with
a very fast hard drive and many CPU cores, you may want to use more
than just a few processes. On a dual-core system, HashBackup will use
both cores; on a 4-core or more system it will use 4. To use more
than 4 cores,
-p must be used. Backup also displays %CPU
utilization at the end of the backup. If you are using
-p4 and %CPU
is 170%, it means that on average, HashBackup could only make use of
1.7 CPUs because of disk I/O, network, or other factors, so
be overkill. Or, it may make sense if you have one huge file that can
take advantage of the extra CPU horsepower, even though most files
cannot. Experimenting is the best guide here.
--sample p specifies a percentage from 1 to 100 of files to sample
for inclusion in the backup. This is a percentage of the files seen,
not a percentage of the amount of data in the files. For example,
--sample 10 means to only backup 10% of the files seen. This can be
useful with simulated backups, especially with huge filesystems, to
try different backup options. Samples are not truly random, but are
"fixed" in the sense that if 5% of files are backed up first, then
10%, the new sample includes the same files as the first 5% plus an
additional 5%. This allows simulating incremental backups. All
pathnames are sampled, whether the come from the command line or
-V backup splits files into variable-sized blocks unless the
block-size-ext config options override this. The default
variable block size is automatically chosen when config option
block-size is set to
auto, the default.
-V overrides the
variable block size for a single backup. Values can be 16K 32K 64K
128K 256K 512K or 1M. During backup of a file the block size ranges
from half to 4x this block size, with the average block size being
about 1.5 times larger than specifed. Larger block sizes dedup less
but have less block tracking overhead in the backup database, leading
to faster restore planning, faster
retain operation, and a
smaller backup database. If most of the files being saved are on the
large side, a larger block size is recommended. If the block size is
changed on an existing backup, changed files will be saved with the
new block size and will not dedup against previously saved versions
because they have a different block size. If a file continues to
change, it will dedup against the versions saved with the same block
-X backup saves all paths listed on the command line, even if they
are on different filesystems. But while saving / (root) for example,
backup will not descend into other mounted filesystems unless
used. Be careful: it’s easy to backup an NFS server or external drive
-Z level sets the compression level. The compression level is 0 to
6, where 0 means no compression, 6 means highest compression, and 7-9
are reserved. The default is 6, the highest compression, so use
for slightly faster but less compression. Compression can be disabled
by file extension with the
no-compress-ext config variable.
Disabling compression with
-Z0 usually increases backup space, disk
writes, and backup transmission time, and is almost always slower than
Raw Block Device, Partition, and Logical Volume Backups
If a block device name is used on the backup command line, all data stored on the block device will be stored. The mime and ctime fields cannot be used to detect changes in a block device, so it is saved on every backup. The block device should either be unmounted or mounted read-only before taking the backup, or, with logical volumes, a snapshot can be taken and then the snapshot backed up. To maximize dedup, a block size of 4K is good for most block device backups. For very large devices, a block size of 64K or 1M will run faster, but may yield a bigger backup. You have to experiment with your data to decide. When saving multiple block devices to the same backup directory, they will only dedup well against each other if the backup block size matches the filesystem allocation unit size - usually 4K. This is similar to the situation with VM image backups; read below for more information.
On Linux, logical volumes are given symbolic links that point to the raw device. When a symbolic link is listed on the backup command line, its target (the raw device) will also be saved.
Named Pipes / Fifo Backups
If a named pipe or fifo is used on the backup command line, the fifo is opened and all data read from it is backed up. For example, this can be used for database dumps: instead of dumping to a huge text file, then backing up the text file, you can direct the dump output to a fifo and let HB back up the dump directly, without needing the intermediate text file. Here are the commands to do a fifo backup:
$ mkfifo myfifo $ cat somefile >myfifo & hb backup -c backupdir myfifo
Be very careful not to start more than one process writing to the same fifo! If this happens, the data from the two processes is intermixed at random and HB cannot tell this is happening. The resulting backup is unusable.
On Mac OSX and BSD, fifo backups are about half the speed of backups from a regular file because the fifo buffers are small. With Linux, fifo backups can be faster than regular file backups.
It is often difficult to decide the best way to back up your specific
set of data. You may have millions of small files, very large files,
VM disk images, photos, large mail directories, large databases, or
many other types of data. Different block sizes (
-B option), arc
file sizes (
arc-size-limit config keyword), dedup table sizes (
keyword), and arc file packing (
pack-… config keywords) will all
affect the size of your backup. Your retention policy will also be a
factor. Determining all of these variables can be difficult.
To make configuring your backup easier, HB has simulated backups: the
backup works as usual, but no backup data is written. The metadata
about the backup - file names, blocks, arc files, etc. - is still all
tracked. You can do daily incremental backups, remove files from the
backup, and perform file retention, just like a normal backup, while
using very little disk space. The
hb stats command is then used to
see how much space a real backup would use. You may want to try one
configuration for a week or two, put that aside, try a different
configuration for a while, and then compare the
hb stats output for
the two backups to see which works better for your data.
To create a simulated backup, use:
$ hb config -c backupdir simulated-backup True
before your first backup. Then use normal HB commands as if this were
a real backup and use the
stats command to see the results. Once
this option has been set and a backup created with it, it cannot be
removed. It also cannot be set on an existing backup.
Some commands like
mount will fail with a simulated backup
since there is no real backup data. A useful option for very large
simulated backups is
--sample, to select only a portion of a
filesystem for the simulation.
A sparse file is a somewhat unusual file where disk space is not fully
allocated. These are often used for virtual disk drive images and
sometimes called "thin provisioning". An
ls -ls command shows a
number on the left, the space actually allocated to the file in
K-bytes, and the size of the file in bytes, which may be much larger.
When supported by the OS, HB will backup a sparse file efficiently by
skipping the unallocated file space, also called "holes", rather than
backing up a long string of zero bytes. If the OS or filesystem does
not support sparse file hole skipping, the file will be saved
normally. In either case, a sparse file is always restored as a
sparse file with holes. Any
-B block size can be used (or none, and
HB will pick a block size).
Sparse files are supported on Linux and BSD; Mac OS does not support sparse files. Here is an example of creating a 10GB sparse file, backing it up, and restoring it:
$ echo abc|dd of=sparsefile bs=1M seek=10000 0+1 records in 0+1 records out 4 bytes (4 B) copied, 0.000209 seconds, 19.1 kB/s $ hb backup -c hb -D1g -B4k sparsefile HashBackup build #1463 Copyright 2009-2016 HashBackup, LLC Backup directory: /home/jim/hb Copied HB program to /home/jim/hb/hb#1463 This is backup version: 0 Dedup enabled, 0% of current, 0% of max /home/jim/sparsefile Time: 1.0s Checked: 5 paths, 10485760004 bytes, 10 GB Saved: 5 paths, 4 bytes, 4 B Excluded: 0 Sparse: 10485760000, 10 GB Dupbytes: 0 Space: 64 B, 139 KB total No errors $ mv sparsefile sparsefile.bak $ time hb get -c hb `pwd`/sparsefile HashBackup build #1463 Copyright 2009-2016 HashBackup, LLC Backup directory: /home/jim/hb Most recent backup version: 0 Restoring most recent version Restoring sparsefile to /home/jim /home/jim/sparsefile Restored /home/jim/sparsefile to /home/jim/sparsefile No errors real 0m0.805s user 0m0.150s sys 0m0.100s $ ls -ls sparsefile* 20 -rw-rw-r-- 1 jim jim 10485760004 Feb 8 18:03 sparsefile 20 -rw-rw-r-- 1 jim jim 10485760004 Feb 8 18:03 sparsefile.bak
Virtual Machine Backups
There are 2 ways to backup a VM: run HashBackup inside the VM as an ordinary backup, or run HashBackup from the VM host machine. Running HB inside the VM is just like any other backup and doesn’t have special considerations. VM images can be backed up while running HashBackup on the VM host, ie, outside the VM. What you are saving is a large file or collection of large files, often sparse, that together are the data for the virtual machine. This has 2 special considerations: consistent backups, and choosing a block size.
Consistent VM Image Backups
If a VM image is backed up while the VM is running, the disk image files being saved may change during the backup. For non-critical VMs that don’t have a lot of activity, this isn’t a big deal and is probably fine. If you restore a VM image saved this way, it is strongly recommended to run a forced fsck if you restore it, to ensure your filesystem is okay.
To get a consistent VM backup, you can: a) suspend the VM during the backup; b) take a snapshot with the VM software, or c) take a snapshot on the VM host filesystem, for example, an LVM snapshot, and backup the snapshot. With method b), you would need to revert to the snapshot after restoring the VM. Using method b), VM snapshots, has the additional advantage that you can run with this snapshot active for a week or two, and backups will be much faster because only the snapshot has to be scanned; the main VM image file will not change. Every week or so the snapshot(s) can be deleted, reincorporating them back into the main VM image, and then the entire VM image will be scanned on the next backup.
Block Size for VM Image Backups
HashBackup normally uses a variable block size that depends on the data being saved. For files that are updated in fixed-size blocks, like VM images, a fixed block size usually gives a smaller backups and faster backup performance.
Smaller block sizes yield higher dedup, though also cause more
overhead because HB has to track more blocks. If HashBackup
recognizes the file suffix for a VM image, for example,
etc., it will switch to a 4K block size. For higher performance,
especially with very large VM images, a larger block size like
could be used. For a "middle of the road" strategy, use
Important note: if you expect to dedup across several VM images
saved into the same backup directory, the most effective way to do
that is with
-B4K. The reason is that different VMs will likely
have different filesystem layouts, even if they have the same
contents. Since most filesystems allocate space in 4K chunks, the
best way to dedup across VMs is to use the same 4K block size. ` `
When choosing a block size, here are some things to keep in mind:
How big is the VM? For small VMs, the block size won’t make much difference, so don’t worry about it.
How much dedup do you expect / want? A smaller block size will dedup better, a larger block size will save and restore faster.
Do you expect dedup across VM images? If so, you probably have to use
If you only want to dedup a VM image against itself, from one incremental backup to the next, you can use any block size. Larger block sizes will create a larger backup because of less dedup but will run faster and restore faster. Smaller block sizes will probably create a smaller backup but take longer to backup and restore.
If you want to dedup a VM image backup against itself within one backup rather than between incrementals, you will need to use
-B4K. Example: a VM image contains two identical 1 GB files. HashBackup will only reliably dedup these files with
-B4Kbecause the individual blocks of the two files could be scattered throughout the VM image, depending on how the filesystem inside the VM decides to allocate space.
How much data changes between backups? For a very active VM, a smaller block size will create smaller backups but a larger block size will run faster. For an inactive VM, not much data will be saved anyway, so don’t worry about the block size.
Do you need fast restores? A smaller block size will take longer because there will be more seeks during the restore; a larger block size will restore faster. It’s hard to quantify, so do test backups and restores to compare restore times.
These trade-offs are very data dependent. To see what works best for your VM images, it’s recommended to start with simulated backups (see simulated-backup config option). You can run simulated backups for a few days with one block size, then create a new simulated backup for a few days with a different block size, and compare how long the backups run vs how much backup space is used. Use
hb statsto check backup space used by a simulated backup.
Example Backup #1
Backup of CentOS
/usr directory running in a Mac OSX VM.
First create the backup directory:
[root@cent5564 /]# hb init -c /hbbackup HashBackup build #1605 Copyright 2009-2016 HashBackup, LLC Backup directory: /hbbackup Permissions set for owner access only Created key file /hbbackup/key.conf Key file set to read-only Setting include/exclude defaults: /hbbackup/inex.conf VERY IMPORTANT: your backup is encrypted and can only be accessed with the encryption key, stored in the file: /hbbackup/key.conf You MUST make copies of this file and store them in a secure location, separate from your computer and backup data. If your hard drive fails, you will need this key to restore your files. If you setup any remote destinations in dest.conf, that file should be copied too. Backup directory initialized
Drop the filesystem cache and do the initial (full) backup. This backup is slower than normal because this VM is only configured for 1 core.
[root@cent5564 /]# echo 3 >/proc/sys/vm/drop_cache [root@cent5564 /]# time hb backup -c /hbbackup -D1g -v1 /usr HashBackup build #1605 Copyright 2009-2016 HashBackup, LLC Backup directory: /hbbackup Cores: have 1, init 1, max 1 Copied HB program to /hbbackup/hb#1605 This is backup version: 0 Dedup enabled, 0% of current, 0% of max Backing up: /hbbackup/inex.conf Backing up: /usr Time: 190.1s, 3m 10s CPU: 125.5s, 2m 5s, 66% Checked: 72311 paths, 1422698489 bytes, 1.4 GB Saved: 72311 paths, 1227204836 bytes, 1.2 GB Excluded: 0 Dupbytes: 175353351, 175 MB, 14% Compression: 67%, 3.1:1 Efficiency: 6.26 MB reduced/cpusec Space: 402 MB, 402 MB total No errors real 3m10.617s user 1m13.299s sys 0m53.212s
Drop the filesystem cache again to time an incremental backup:
[root@cent5564 /]# echo 3 >/proc/sys/vm/drop_caches [root@cent5564 /]# time hb backup -c /hbbackup -D1g -v1 /usr HashBackup build #1605 Copyright 2009-2016 HashBackup, LLC Backup directory: /hbbackup Cores: have 1, init 1, max 1 This is backup version: 1 Dedup enabled, 21% of current, 0% of max Backing up: /hbbackup/inex.conf Backing up: /usr Time: 22.1s CPU: 19.1s, 86% Checked: 72311 paths, 1422698489 bytes, 1.4 GB Saved: 2 paths, 0 bytes, 0 Excluded: 0 No errors real 0m22.603s user 0m5.949s sys 0m13.608s
Example #2 Backup & Restore Mac Mini Server
Backup Intel i7 Mac Mini Server root drive (643,584 files in 89GB) to the other drive. Both drives are slower 5600 rpm hard drives. During this backup, HashBackup used 267MB of RAM,
sh-3.2# purge; /usr/bin/time -l hb backup -c /hbtest -v1 -D1g / HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC Backup directory: /hbtest This is backup version: 0 Dedup enabled, 0% of current, 0% of max Backing up: / Mount point contents skipped: /dev Mount point contents skipped: /home Mount point contents skipped: /net Backing up: /hbtest/inex.conf Time: 2323.1s, 38m 43s CPU: 2004.4s, 33m 24s, 86% Checked: 643631 paths, 89316800581 bytes, 89 GB Saved: 643584 paths, 89238502003 bytes, 89 GB Excluded: 47 Dupbytes: 21638127507, 21 GB, 24% Compression: 36%, 1.6:1 Efficiency: 15.32 MB reduced/cpusec Space: 57 GB, 57 GB total No errors 2325.97 real 1746.39 user 258.65 sys 267079680 maximum resident set size 3804171 page reclaims 1019 page faults 0 swaps 88820 block input operations 37261 block output operations 823626 voluntary context switches 1314791 involuntary context switches
Restore the backup - 28 minutes.
sh-3.2# purge; /usr/bin/time -l hb get -c /hbtest -v1 / HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC Backup directory: /hbtest Most recent backup version: 0 Restoring most recent version Begin restore Restoring / to /Users/jim/test Restore? yes Restored / to /Users/jim/test 0 errors 1672.70 real 1333.01 user 190.00 sys 276615168 maximum resident set size 1388049 page reclaims 964 page faults 0 swaps 2732 block input operations 53260 block output operations 35939 voluntary context switches 3865449 involuntary context switches*
Example 3: Incremental backup of Mac Mini with 3600 daily backups over 11 years
This is the daily backup of the HashBackup development server running on a 2010 Mac Mini (Intel Core 2 Duo 2.66GHz) with 2 SSD drives. There are 12 virtual machines hosted on this server so it averages about 80% idle. The backup of the main drive is written to the other SSD, copied to an external USB 2.0 drive, and copied to Backblaze B2 over a very slow 1Mbit/s (128KB/s) Internet connection.
HashBackup #2527 Copyright 2009-2021 HashBackup, LLC Backup directory: /hbbackup Backup start: 2021-11-26 02:00:02 Using destinations in dest.conf This is backup version: 3609 Dedup enabled, 60% of current size, 12% of max size /Library/Caches/com.apple.DiagnosticReporting.Networks.plist /Library/Caches ... /private/var /private / Copied arc.3609.0 to usbcopy (9.5 MB 1s 8.9 MB/s) Waiting for destinations: b2 Copied arc.3609.0 to b2 (9.5 MB 1m 7s 141 KB/s) Checking database before upload Writing hb.db.6321 Copied hb.db.6321 to usbcopy (178 MB 5s 32 MB/s) Waiting for destinations: b2 Copied hb.db.6321 to b2 (178 MB 19m 21s 153 KB/s) Copied dest.db to usbcopy (9.6 MB 0s 10 MB/s) Waiting for destinations: b2 Copied dest.db to b2 (9.6 MB 1m 5s 147 KB/s) Removed hb.db.6309 from b2 Removed hb.db.6310 from b2 Removed hb.db.6311 from b2 Removed hb.db.6309 from usbcopy Removed hb.db.6310 from usbcopy Removed hb.db.6311 from usbcopy Time: 826.2s, 13m 46s CPU: 963.9s, 16m 3s, 116% Wait: 1561.9s, 26m 1s Mem: 394 MB Checked: 564451 paths, 79881638243 bytes, 79 GB Saved: 155 paths, 23765297154 bytes, 23 GB Excluded: 39 Dupbytes: 23663773091, 23 GB, 99% Compression: 99%, 2480.9:1 Efficiency: 23.50 MB reduced/cpusec Space: +9.5 MB, 57 GB total No errors Exit 0: Success
There are many VM disk images stored on this server, and a tiny change in these disk images means the whole disk image has to be backed up. In this backup, out of 79GB total, files totalling 23GB were changed. But because the amount of actual data changed was small, HashBackup’s dedup feature compressed a 23GB backup down to 9.5MB of new backup data. The backup was also copied to a USB drive for redundancy. A full-drive restore from this backup is shown on the Get page.
Example 4: Full and incremental backup of 2GB Ubuntu VM image
First backup a 2GB Ubuntu test VM image with 64K block size. This is
on a 2010 Macbook Pro, Core 2 Duo 2.5GHz, SSD system. The peak HB
memory usage during this backup was 77MB. The
purge command clears
all cached disk data from memory.
$ purge; hb backup -c hb -B64K -D1g ../testfile HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC Backup directory: /Users/jim/hbrel/hb Copied HB program to /Users/jim/hbrel/hb/hb#1619 This is backup version: 0 Dedup enabled, 0% of current, 0% of max /Users/jim/hbrel/hb/inex.conf /Users/jim/testfile Time: 36.9s CPU: 61.6s, 1m 1s, 166% Checked: 7 paths, 2011914751 bytes, 2.0 GB Saved: 7 paths, 2011914751 bytes, 2.0 GB Excluded: 0 Dupbytes: 95289344, 95 MB, 4% Compression: 56%, 2.3:1 Efficiency: 17.63 MB reduced/cpusec Space: 873 MB, 873 MB total No errors
Now touch the VM image to force an incremental backup. No data actually changed, but HashBackup still has to scan the file for changes. Peak memory usage during this backup was 77MB.
$ touch ../testfile; purge; hb backup -c hb -D1g -B64K ../testfile HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC Backup directory: /Users/jim/hbrel/hb This is backup version: 1 Dedup enabled, 13% of current, 0% of max /Users/jim/testfile Time: 15.1s CPU: 18.4s, 121% Checked: 7 paths, 2011914751 bytes, 2.0 GB Saved: 6 paths, 2011914240 bytes, 2.0 GB Excluded: 0 Dupbytes: 2011914240, 2.0 GB, 100% Compression: 99%, 49119.0:1 Efficiency: 104.30 MB reduced/cpusec Space: 40 KB, 873 MB total No errors
Try the same backup and incremental, this time with a variable block size. This gives 2x more dedup and a smaller backup, but 5% higher runtime. Peak memory usage again was around 78MB.
$ purge; hb backup -c hb -D1g ../testfile HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC Backup directory: /Users/jim/hbrel/hb Copied HB program to /Users/jim/hbrel/hb/hb#1619 This is backup version: 0 Dedup enabled, 0% of current, 0% of max /Users/jim/hbrel/hb/inex.conf /Users/jim/testfile Time: 40.9s CPU: 66.7s, 1m 6s, 163% Checked: 7 paths, 2011914751 bytes, 2.0 GB Saved: 7 paths, 2011914751 bytes, 2.0 GB Excluded: 0 Dupbytes: 162587404, 162 MB, 8% Compression: 57%, 2.4:1 Efficiency: 16.56 MB reduced/cpusec Space: 853 MB, 853 MB total No errors
Now an incremental backup with variable block size. This shows that for VM images, variable-block incremental backup is slower than fixed-block backup because data has to be scanned byte-by-byte.
$ purge; touch ../testfile; hb backup -c hb -D1g ../testfile HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC Backup directory: /Users/jim/hbrel/hb This is backup version: 1 Dedup enabled, 16% of current, 0% of max /Users/jim/testfile Time: 21.4s CPU: 24.8s, 115% Checked: 7 paths, 2011914751 bytes, 2.0 GB Saved: 6 paths, 2011914240 bytes, 2.0 GB Excluded: 0 Dupbytes: 2011914240, 2.0 GB, 100% Compression: 99%, 40932.5:1 Efficiency: 77.36 MB reduced/cpusec Space: 49 KB, 853 MB total No errors
Let’s compare how fast
dd can read the test VM image vs HashBackup,
both with a block size of 64K. The results below show
dd reads at
233 MB/sec on this system, HashBackup reads and compares to the
previous backup at 133 MB/sec (if no data changes). The difference
between the full backup time where all data is saved, and the
incremental time where no data is saved, gives an idea of the time
HashBackup needs to backup changed data. Based on this, HashBackup
can backup 1% of changed data for this file in about .217 seconds, or
2.17 seconds if 10% of data changed. This is added to the incremental
time with no changes to estimate the backup time with changed data.
You can use similar formulas to estimate backup times of large VM
IMPORTANT: these tests are with a 2010 Intel 2.5GHz Core2Duo CPU. More recent CPUs will have faster backup times.
$ purge; dd if=../testfile of=/dev/null bs=64k 30699+1 records in 30699+1 records out 2011914240 bytes transferred in 8.638740 secs (232894413 bytes/sec) # compute how fast HashBackup did an incremental backup and # estimate how long it takes to backup 1% changed data $ bc bc 1.06 Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc. This is free software with ABSOLUTELY NO WARRANTY. For details type `warranty'. 2011914240/15.1 133239353.642 <== HB incremental backup rate was 133.23 MB/sec for unchanged VM image (36.9-15.1)/100 .218 <== approx seconds to backup 1% changed data in the test VM image
Do an incremental backup with 10% of the 2GB VM image data changed at random, to verify the expected runtime of 21.8 seconds for 10% changed data:
$ purge; time hb backup -c hb -D1g -B64K ../testfile Backup directory: /Users/jim/hbrel/hb Group test & size: 262144 1048576 This is backup version: 2 Dedup enabled, 13% of current, 0% of max /Users/jim/testfile Time: 17.2s <== runtime is a bit lower than expected - yay! CPU: 23.1s, 134% Checked: 7 paths, 2011914751 bytes, 2.0 GB Saved: 6 paths, 2011914240 bytes, 2.0 GB Excluded: 0 Dupbytes: 1816027136, 1.8 GB, 90% <== shows 10% changed data Compression: 95%, 22.2:1 Efficiency: 79.42 MB reduced/cpusec Space: 90 MB, 964 MB total No errors
NetApp Snapshot Backups
NetApp Filers have a built-in snapshot capability. Snapshots can be
taken automatically with generated pathnames, or manually with a
specified pathname. An example of an automatic name would be
/nfs/snapshot.1, etc., with the higher number
being the most recent snapshot. Saving the highest snapshot is a
problem for HashBackup because the pathname changes on every backup,
causing unreasonable metadata growth in the
hb.db database file.
To make efficient backups, use a bind mount on Linux or nullfs mount on FreeBSD to make the snapshot directory appear at a fixed name:
$ mount --bind /nfs/snapshot.N /nfstree (Linux)
$ mount_nullfs /nfstree /nfs/snapshot.N (FreeBSD)
/nfstree with HashBackup.
Dirvish is an open-source backup tool that uses rsync to create
hard-linked backup trees. Each tree is a complete snapshot of a
backup source, but unchanged files are actually hard links. This
saves disk space since file data is not replicated in each tree. The
problem is that these trees often include a timestamp in the pathname.
If the trees are backed up directly, every pathname is unique. This
causes unreasonable metadata growth in the
hb.db file, that leads to
excessive RAM usage.
For more efficient backups, mount the tree directory (the directory containing user files) to a fixed pathname with a Linux bind mount or FreeBSD nullfs mount. See the NetApp section above for details.
A second issue with hard-link backups is that HashBackup maintains an
in-memory data structure for hard link inodes so that it can properly
relate linked files. This data structure is not very memory efficient
and uses ~165 bytes for each inode that is a hard link. This can be
disabled with the
--no-ino option, though that disables all
Rsnapshot is an rsync-based backup tool that creates hard-linked
backup trees. A big difference is that with rsnapshot, the most
recent backup is contained in the
.0 directory, ie, the most recent
backup is at a fixed mount point. HashBackup can be used to backup
.0 directory every night. You should avoid backing up the other
rsnapshot directories. If you need to save the other directories:
rename the rsnapshot directory you want to backup as
rename the directory back to its original name
repeat these steps for every rsnapshot directory you want to backup
when finished, rename
going forward, only save
instead of renaming you can use the bind mount trick (see the NetApp section)
Since rsnapshot backups are hard-link trees, nearly every file is a
hard link and you may want to consider using
--no-ino to lower
memory usage. This disables all hard-link processing in HashBackup.
Using HashBackup for Archiving
HashBackup is designed as a backup solution, where a collection of files are saved on a rather frequent basis: daily, monthly, etc. An archive is a special kind of backup that is more of a one-time event, for example, saving a hard drive or server that is going to be decomissioned. It could be 5 years (or never!) before the archive is accessed again.
HashBackup uses a database to store information about a backup. The format of this database is periodically changed as new features are added, and any structural changes are automatically applied as necessary when a new release of HB accesses a backup. HB can automatically update any backup as long as it is accessed within the last year or so with the latest version. If it has been more than a year, the latest version of HB may or may not be able to automatically update a backup’s database; an older version may need to be used first to partially upgrade the database, then a later version used to fully upgrade the database.
One easy way to keep an archive’s database up to date is to create a
cron job to backup a very small file every month. This file doesn’t
have to be related in any way to the original archive - a dummy file
/tmp is fine. Also include an
hb upgrade in the cron job.
Together these will ensure that your copy of the HB program stays
up-to-date and that your archive’s database is automatically upgraded
Another way to handle archives is to always run the version of HB contained within the backup directory.
Syncing a New Destination
When a new destination is added to
dest.conf, HashBackup will
automatically sync existing backup data to the new destination. This
can be somewhat difficult when it is going to take days or weeks to
get the new destination in sync, either because bandwidth is limited
or the backup is already very large, and it interferes with daily
backups. A script like below can be used until the new destination
gets in sync. To get started, add the new destination to
and include the
off keyword in the new destination. Then run this
script daily from crontab, modified for your own site.
IMPORTANT: this method only works if you have a complete copy of the
backup in the local directory. If you have
cache-size-limit set to
anything but -1, this method does not work because remote-to-remote
syncs happen before the backup starts running. For time-limited
remote-to-remote syncs, use this script but you’ll also need to use
either the Unix
timeout command or add a crontab entry with
killall -TERM hb to abort the sync at a certain time.
#!/bin/sh # disable new destination for regular backup operation sed -i bak "s/^#off/off/" /hbbackup/dest.conf nice /usr/local/bin/hb log backup -c /hbbackup / nice /usr/local/bin/hb log retain -c /hbbackup -s30d12m -x3m nice /usr/local/bin/hb log selftest -c /hbbackup -v4 --inc 1d/90d,900MB # enable new destination to run sync for 20 hours sed -i bak "s/^off/#off/" /hbbackup/dest.conf nice /usr/local/bin/hb log backup -c /hbbackup /hbbackup/inex.conf --maxwait 20h