Backup
Performs a backup. Advanced options are described later, but commonly-used options are:
$ hb backup [-c backupdir] [-D mem] [-v level] [-X]
path1 path2 ...
Common Options
-c
the backup is stored in the local backup directory specified with
the -c
option or a default backup directory according to built-in
rules if -c
isn’t used. If a full local copy of the backup is not
wanted in the backup directory, set cache-size-limit
with the
config
command and create a dest.conf
file in the backup
directory. See Destinations for details
and examples.
-D
enable dedup, saving only 1 copy of identical data blocks. -D
is followed by the amount of memory to dedicate to the dedup
operation, for example: -D1g
would use 1 gigabyte of RAM, -D100m
would use 100mb. The default is -D100M
. The more memory, the more
duplicate data HashBackup can detect, though it is still effective
even with limited memory. Dedup is also enabled with the dedup-mem
config option. See Dedup for detailed
information and recommendations. -D0
disables dedup for one backup.
Unlike most options, -D
uses a factor of 1024, so -D100M
means 100
* 1024 * 1024 bytes.
-v
control how much output is displayed, higher numbers for more:
-v0
= print version, backup directory, copy messages, error count
-v1
= include names of skipped filesystems
-v2
= include names of files backed up (this is the default)
-v3
= include names of excluded files
-X
cross mount points (separate file systems). Unless -X
is used,
each filesystem to backup must be explicitly listed on the command
line . Use the df
command to see a list of mounted filesystems. To
see what filesystem a directory is on, use df
followed by a
pathname, or use df .
to see the current directory’s filesystem.
Backup Operation
Each backup creates a new version in the backup directory containing
all files modified since the previous backup. An easy way to think of
this is a series of incremental backups, stacked one on top of the
other. HashBackup presents the illusion that every version is a full
backup, while performing the backups at incremental speeds and using
incremental backup storage space and bandwidth. While full backups
are an option, there is no requirement to do them. HashBackup is
designed to not need full backups, yielding huge savings of time, disk
space, and bandwidth compared to traditional backup methods. The
retain
command explains how to maintain weekly or monthly snapshots
- or both!
Specifying the Backup Directory
All commands accept a -c
option to specify the backup directory.
HashBackup will store backup data and the encryption key in
this directory. If -c
is not specified on the command line, the
environment variable HASHBACKUP_DIR
is checked. If this environment
variable doesn’t exist, the directory ~/hashbackup
is used if it
exists. If ~/hashbackup
doesn’t exist, /var/hashbackup
is used.
Backup directories must be first be initialized with the hb init
command. If a complete local copy of your backup is not wanted, set
the cache-size-limit
config option with the config
command to
limit the size of the local backup directory, then setup a dest.conf
file to send backup data to a destination.
HashBackup saves all file properties in a database, so backups can be stored on any type of "dumb" storage: FAT, VFAT, CIFS/SMB/Samba, NFS, USB thumb drives, SSHFS, WebDAV, FTP servers, etc. All file attributes, including hard links, ACLs, and extended attributes will be accurately saved and restored, even if not supported by the storage system. An encrypted incremental backup of the backup database is automatically included with every backup.
It’s possible to backup directly to a mounted remote filesystem or USB
drive with the -c
option. However, the backup performance may be
slower than with a local filesystem, and these may not provide robust
sync and lock facilities. Sync facilities are used to ensure that the
backup database doesn’t get corrupted if the computer halts
unexpectedly during the backup process, for example, if there is a
power outage. Lock facilities ensure that two backups don’t run at
the same time, possibly overwriting each others' backup files. It’s
best to specify a local backup directory with -c
and let HashBackup
copy the backup data to mounted remote storage or USB drive by using a
Dir
destination in the dest.conf
file in the backup directory.
Destinations
Remote destinations are optional. They are setup in the dest.conf
file, located in the backup directory. See the
Destinations page for more detailed
info on how to configure destinations for SSH, RSYNC, and FTP servers,
Amazon S3, Google Storage, Backblaze, Gmail, IMAP/emailservers, and
simple directory (dir) destinations. Dir destinations are used to
copy the backup to USB thumb drives, NFS, WebDAV, Samba, and any
other storage that is mounted as a filesystem.
The backup command creates backup archive files in the backup
directory. After arc files fill up, they can be transmitted to one or
more remote destinations, configured with a dest.conf
file.
Transmissions run in parallel with the backup process. After the
backup has completed, it waits for all files to finish transmitting.
By default, archives are kept in the local backup directory after
sending to remote destinations, making restores very fast because data
does not have to be downloaded from a remote site. Keeping a local
copy also adds redundancy and makes it easier to pack remote archives
(remove deleted or expired backup data) because archives don’t have to
be downloaded first. It is possible to delete archives from the local
backup directory as they are sent to destinations by using the
cache-size-limit
config option.
If storing the backup directly to a USB or other removable drive with
the -c
option, and that’s the only backup copy you want, there is no
need to use the dest.conf
file. dest.conf
is only used to copy
your backup, usually to a remote site.
Important Security Note: the backup encryption key is stored in the file
key.conf
in the backup directory. Usually this is a local
directory, but when writing a backup directly to mounted
remote storage with the -c
option, for example, Google Drive,
DropBox, etc., or to a removable USB drive, be sure to set a
passphrase when initializing the backup directory. This is done
with the -p ask
option to hb init
or hb rekey
. Both the
passphrase and key are required to access the backup. When
writing directly to mounted remote storage that you control, such as
an NFS server, a passphrase may not be necessary. If a
dest.conf
Dir
destination is used to copy the backup to remote storage
or a removable drive and the backup directory and key are on a
local drive, a passphrase is not necessary because the key is never
copied to destinations.
Excludes
An inex.conf
file is created by the init
command in the backup
directory and automatically excludes a handful of system files and
directories, like swapfiles, /tmp
, core
files, /proc
, /sys
,
and hibernation files. The backup directory itself is also
automatically excluded.
To exclude other files, edit the inex.conf
file in the backup
directory. The format of this file is:
# comment lines
E(xclude) <pathname>
Example inex.conf:
# exclude all .wav files:
Exclude *.wav
# exclude the /Cache directory and its contents:
e /Cache
# save the /Cache directory itself, but don't save the contents.
# Requesting a backup of /Cache/xyz still works.
e /Cache/
# save the /Cache directory itself, but don't save the contents.
# Requesting a backup of /Cache/xyzdir will save the directory itself
# since it was explicitly requested, but will not save xyzdir's
# contents
e /Cache/*
Any abbreviation of the exclude keyword at the beginning of the line
is recognized. Tilde characters ~
are not expanded into user
directories.
There are several other ways to exclude files from the backup:
-
config option
no-backup-tag
can be set to a list of filenames, separated by commas. If a directory contains any of these files, only the directory itself and the tag files are saved. HB does not read or check the contents of tag files. Typical values here are.nobackup
orCACHEDIR.TAG
-
files with the nodump flag set are not backed up (on Linux, this requires setting config option
backup-linux-attrs
totrue
) -
config option
no-backup-ext
can be set to a list of comma-separated extensions, with or without dots. Any files ending in one of the extensions listed is not backed up. This test is done without regard to uppercase or lowercase, soavi
will exclude both.avi
and.AVI
files
Web browser and other caches are difficult to backup for a number of reasons (see Browser Caches note), and should always be excluded from backups, especially on large servers that store caches for many users.
Advanced Options
The backup command has more options that may be useful:
-B blocksize
the backup command splits files into blocks of data.
It chooses a "good" block size based on the type of file being saved
and by default uses variable-sized blocks. Variable block sizes work
well for files that change in unpredictable ways, like a text file.
For files that change in fixed-size units - raw database files (not
text dumps!), VM images, etc. - forcing a specific, fixed block size
works better. Fixed-size blocks can be any size from 128 bytes to
2GB-1 (2,147,483,647) bytes.
-B
forces a fixed block size, which usually dedups less than a
variable block size except for cases where a file is updated in
fixed chunks; then it is usually more efficient than variable blocks.
Larger block sizes are more efficient in general but dedup less; small
block sizes cause more overhead but dedup better. Sometimes the extra
dedup savings of a small block size is less than the extra metadata
needed to track more blocks. Since this is very data dependent,
experiment with actual data to determine the best block size to use.
-B1m
usually speeds up backups on a single CPU system because it
disables variable block sizes. -B4M
or higher is useful when files
do not have much common data, for example media files. For identical
files, backup will only save 1 copy. A huge block size like 64M may
be useful for large scientific data files that have little duplicate
data. A large block size does not make small files use more backup
space. The simulated-backup
config option with different block
sizes is helpful for determining the best block size. Unlike most
options, -B
uses a multiplier of 1024, so -B4K
means 4096 bytes
rather than 4000 bytes.
IMPORTANT: very large block sizes will use a lot of RAM. A good
estimate is 8x block size with a couple of CPUs, or more if there are
more CPUs. -p0
disables multi-threading and will use less RAM
and CPU but will also be slower.
-F filesources
read pathnames to be saved from file sources. More
than one file source can be used. This is often used with very large
file systems to avoid having to scan for modified files by having the
processes that modify files update a file list. Each file source is
a:
-
pathname of a file containing pathnames to backup, 1 per line. Blank lines and lines beginning with # (comments) are ignored. Pathnames to backup can be either individual files or directories. If a pathname doesn’t exist in the filesystem but is present in the backup, it is marked deleted in the backup.
-
pathname of a directory, where each file in the directory contains a list of pathnames to backup as above.
The backup command line can still have pathnames to backup in addition
to -F
, but these must come before -F
.
--full
the backup command usually backs up only modified files and
determines this by looking at mtime, ctime, and other file attributes
for each file. This option overrides that behavior and causes every
file to be saved, even if it has not changed. Taking a full backup
adds redundancy to the backup data as it will not dedup against
previous backups. Another way to accomplish this is to create a new
backup directory on both local and remote storage, but this is harder
to manage and retain will not work across multiple backup directories.
Full backups also limit the number of incremental archives required to
do large restores, though in testing, restoring a complete OSX system
from 3 years of daily incrementals took only 15% longer than restoring
from a full backup.
-m maxfilesize
skip files larger than maxfilesize
. The size can
be specified as 100m or 100mb for 100 megabytes, 3g or 3gb for 3
gigabytes, etc. This limit does not apply to fifos or devices since
they are always zero size.
--maxtime time
specifies the maximum time to spend saving files.
When this time is exceeded, the backup stops and waits for uploads to
finish using --maxwait
, which is adjusted based on how long the
backup took.
--maxwait time
when used by itself, specifies the maximum time to
wait for files to be transmitted to remote destinations after the
backup has finished. When used with --maxtime
, the wait time is
increased if the backup finishes with time remaining. The special
value off
disables all destinations for this one backup. This is
useful for example to avoid sending hourly backups offsite and letting
the nightly backup send them.
These two options have several uses:
-
Large initial backups take much longer than incremental backups. Using these options allow large initial backups to span several days without going outside the time window reserved for backups.
-
They prevent incrementals from running into production time when a large amount of data changes.
-
If a new destination is added to an existing backup, it may take days for existing backup data to be uploaded to the new destination. These options limit how long each backup spends trying to upload to the new destination.
If a backup does not complete because of --maxtime
, the next backup
using the same command line will restart where the previous backup
left off, without rescanning the entire filesystem. This allows
backups of huge filesystems with tens of millions of files over a
period of time.
Time limits may be exceeded for various technical reasons. Large arc files, slow upload rates, and many worker threads cause more variability in the timing. It’s a good idea to initially set the time limits an hour less than required to get a feel for how much variability occurs in a specific backup environment.
Examples for --maxtime
and --maxwait
--maxtime 1h
means backup for up to 1 hour then wait the remainder
of the hour for uploads to finish, ie, total time is limited to 1 hour.
--maxwait 1h
means backup everything requested, however long it
takes, but only wait 1 hour after the backup for uploads to finish.
--maxtime 1h --maxwait 2h
means backup for up to 1 hour, then wait 2
hours + the remaining backup time for uploads to finish, ie, the total
time is limited to 3 hours.
--maxtime 1h --maxwait 1y
means backup for 1 hour, then wait a year
for uploads to finish, ie, only the backup time is limited.
If uploads do not finish, the remote backup will not be in
a fully consistent state because some arc files have not been uploaded
and are only present in the local backup directory. The local backup
directory will always be in a consistent state and restores will work
fine in this situation. But if the local backup directory is lost, a
recover command can only recover the data that was uploaded and a
selftest --fix may be needed to compensate for missing arc files
that were never uploaded. --maxwait 1y ensures all remotes are
consistent and is therefore the safest setting, and is the default
without either option, but it does not limit backup time.
|
--no-ino
on Unix filesystems, every file has a unique inode number.
If two paths have the same inode number, they are hard linked and
both paths refer to the same file data. Some filesystems do not have
real inode numbers: FUSE (sshfs, UnRaid) and Samba/SMB/CIFS for
example. Instead, they invent inode numbers and cache them for a
while, sometimes for many days. HashBackup normally verifies a path’s
inode number during backups, and if it changes, it triggers a backup
of the path. Inode numbers are also used to detect hard links. If a
filesystem does not have real inode numbers, it causes unexpected and
unpredictable full backups of files that haven’t changed and after
enough of these mistakes, HashBackup will stop with an error. The
--no-ino
option prevents these unnecessary full backups.
A negative side-effect of --no-ino
is that hard-link detection is
disabled since it relies on stable inode numbers. With --no-ino
,
hard-linked files are backed up like regular files and if restored,
are not hard-linked.
--no-mtime
there are two timestamps on every Unix file: mtime is the
last time a file’s data was modified, and ctime is the last time a
file’s attributes were modified (permissions, etc). When a file’s
data is changed, both mtime and ctime are updated. If only the
attributes are changed, only ctime is updated. Because mtime can be
set by user programs, some administrators may not trust it to indicate
whether a file’s data has changed. The --no-mtime
option tells the
backup command to verify whether file contents have changed by
computing a strong checksum and comparing that with the previous backup
rather than trusting mtime. This is not recommended for normal use
since the backup program has to read every unmodified file’s data to
compute the checksum, then possibly read it again to backup the file.
-p procs
specifies how many additional processes to use for the
backup. Normally HashBackup uses 1 process on single-CPU systems, and
several processes on multi-core systems with more than 1 CPU.
Multiple processes speed up the backup but also put a higher load on
your system. To reduce the performance impact of a backup on your
system, you may want to use -p0
to force using only 1 CPU. Or, with
a very fast hard drive and many CPU cores, you may want to use more
than just a few processes. On a dual-core system, HashBackup will use
both cores; on a 4-core or more system it will use 4. To use more
than 4 cores, -p
must be used. Backup also displays %CPU
utilization at the end of the backup. If %CPU is 170%, it means that
on average, HashBackup could only make use of 1.7 CPUs because of disk
I/O, network, or other factors, so -p2
is fine. Experimenting is
the best guide here. A negative value means to reserve (not use)
cores, so -p-2
means to use all but 2 cores.
--sample p
specifies a percentage from 1 to 100 of files to sample
for inclusion in the backup. This is a percentage of the files seen,
not a percentage of the amount of data in the files. For example,
--sample 10
means to only backup 10% of the files seen. This can be
useful with simulated backups, especially with huge filesystems, to
try different backup options. Samples are not truly random, but are
selected so that if 5% of files are backed up first, then 10%, the new
sample includes the same files as the first 5% plus an additional 5%.
This allows simulating incremental backups. All pathnames are
sampled, whether they come from the command line or -F
file sources.
-t tagname
tag the backup with information-only text shown with hb versions
-V
backup splits files into variable-sized blocks unless the -B
option or block-size-ext
config options override this. The default
variable block size is automatically chosen when config option
block-size
is set to auto
, the default. -V
overrides the
variable block size for a single backup. Values can be 16K 32K 64K
128K 256K 512K or 1M. During backup of a file the block size ranges
from half to 4x this block size, with the average block size being
about 1.5 times larger than specifed. Larger block sizes dedup less
but have less block tracking overhead in the backup database, leading
to faster restore planning, faster rm
and retain
operation, and a
smaller backup database. Larger blocks also compress better. If most
of the files being saved are on the large side, a larger block size is
recommended. If the block size is changed on an existing backup,
changed files will be saved with the new block size and will not dedup
against previously saved versions because they have a different block
size. If a file continues to change, it will dedup against the
versions saved with the same block size.
-X
backup saves all paths listed on the command line, even if they
are on different filesystems. But while saving / (root) for example,
backup will not descend into other mounted filesystems unless -X
is
used. Be careful: it’s easy to backup an NFS server or external drive
unintentionally with -X
.
-Z level
sets the compression level. The compression level is 0 to
6, where 0 means no compression, 6 means highest compression, and 7-9
are reserved. The default is 6, the highest compression, so use -Z
for slightly faster but less compression. Compression can be disabled
by file extension with the no-compress-ext
config variable.
Disabling compression with -Z0
usually increases backup space, disk
writes, and backup transmission time, and is almost always slower than
-Z1
.
Raw Block Device, Partition, and Logical Volume Backups
If a device name is used on the backup command line, all data stored on the device will be stored. The mtime and ctime fields cannot be used to detect changes in a block device, so the whole device is read on every backup, though only modified data is saved if dedup is enabled. The block device should either be unmounted or mounted read-only before taking the backup, or, with logical volumes, a snapshot can be taken and then the snapshot backed up. To maximize dedup, a block size of 4K is good for most block device backups since this is the most common filesystem allocation unit. For very large devices, a block size of 64K or 1M will run faster, but may yield a bigger backup with less dedup. Experiment with your data to decide. When saving multiple block devices to the same backup directory, they will only dedup well against each other if the backup block size matches the filesystem allocation unit size - usually 4K. This is similar to the situation with VM image backups; read below for more information.
On Linux, logical volumes are given symbolic links that point to the raw device. When a symbolic link is listed on the backup command line, its target (the raw device) will also be saved.
Named Pipes / Fifo Backups
If a named pipe or fifo is used on the backup command line, the fifo is opened and all data read from it is backed up. For example, this can be used for database dumps: instead of dumping to a huge text file, then backing up the text file, you can direct the dump output to a fifo and let HB back up the dump directly, without needing the intermediate text file. The commands to do a fifo backup are:
$ mkfifo myfifo
$ cat somefile >myfifo & hb backup -c backupdir myfifo
Be very careful not to start more than one process writing to the same fifo! If this happens, the data from the two processes is intermixed at random and HB cannot tell this is happening. The resulting backup is unusable.
On Mac OSX and BSD, fifo backups are about half the speed of backups from a regular file because the fifo buffers are small. With Linux, fifo backups can be faster than regular file backups.
Simulated Backups
It is often difficult to decide the best way to back up a specific set
of data. There may be millions of small files, very large files, VM
disk images, photos, large mail directories, large databases, or many
other types of data. Different block sizes (-B
option), arc file
sizes (arc-size-limit
config keyword), dedup table sizes (-D
keyword), and arc file packing (pack-…
config keywords) will all
affect the size of the backup. The retention policy will also be a
factor. Determining all of these variables can be difficult.
To make configuring backups easier, HB has simulated backups: the
backup works as usual, but no backup data is written. The metadata
about the backup - file names, blocks, arc files, etc. - is still all
tracked. You can do daily incremental backups, remove files from the
backup, and perform file retention, just like a normal backup, while
using very little disk space. The hb stats
command is then used to
see how much space the backup would use. It’s easy to try one
configuration for a week or two, put that aside, try a different
configuration for a while, and then compare the hb stats
output for
the two backups to see which works better.
To create a simulated backup, use:
$ hb config -c backupdir simulated-backup True
before the first backup. Then use normal HB commands as if this were
a real backup and use the stats
command to see the results. Once
this option has been set and a backup created with it, it cannot be
removed. It also cannot be set on an existing backup.
Some commands like get
and mount
will fail with a simulated backup
since there is no real backup data. A useful option for very large
simulated backups is --sample
, to select only a portion of a
filesystem for the simulation.
Sparse Files
A sparse file is a somewhat unusual file where disk space for long
runs of zeroes is not fully allocated. These are often used for
virtual disk images and sometimes called "thin provisioning". An ls
-ls
command shows a number on the left, the space actually allocated
to the file in 512-byte blocks, and the size of the file in bytes,
which may be much larger. When supported by the OS, HB will backup a
sparse file efficiently by skipping the unallocated file space, also
called "holes", rather than backing up a long string of zero bytes.
If the OS or filesystem does not support sparse file hole skipping,
the file will be saved normally. In either case, a sparse file is
always restored as a sparse file with holes. Any -B
block size can
be used (or none, and HB will pick a block size).
Sparse files are supported on Linux and BSD; Mac OS does not support sparse files. Here is an example of creating a 10GB sparse file, backing it up, and restoring it:
$ echo abc|dd of=sparsefile bs=1M seek=10000
0+1 records in
0+1 records out
4 bytes (4 B) copied, 0.000209 seconds, 19.1 kB/s
$ hb backup -c hb -D1g -B4k sparsefile
HashBackup build #1463 Copyright 2009-2016 HashBackup, LLC
Backup directory: /home/jim/hb
Copied HB program to /home/jim/hb/hb#1463
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
/home/jim/sparsefile
Time: 1.0s
Checked: 5 paths, 10485760004 bytes, 10 GB
Saved: 5 paths, 4 bytes, 4 B
Excluded: 0
Sparse: 10485760000, 10 GB
Dupbytes: 0
Space: 64 B, 139 KB total
No errors
$ mv sparsefile sparsefile.bak
$ time hb get -c hb `pwd`/sparsefile
HashBackup build #1463 Copyright 2009-2016 HashBackup, LLC
Backup directory: /home/jim/hb
Most recent backup version: 0
Restoring most recent version
Restoring sparsefile to /home/jim
/home/jim/sparsefile
Restored /home/jim/sparsefile to /home/jim/sparsefile
No errors
real 0m0.805s
user 0m0.150s
sys 0m0.100s
$ ls -ls sparsefile*
20 -rw-rw-r-- 1 jim jim 10485760004 Feb 8 18:03 sparsefile
20 -rw-rw-r-- 1 jim jim 10485760004 Feb 8 18:03 sparsefile.bak
Virtual Machine Backups
There are 2 ways to backup a VM: run HashBackup inside the VM as an ordinary backup, or run HashBackup from the VM host machine. Running HB inside the VM is just like any other backup and doesn’t have special considerations. VM images can be backed up while running HashBackup on the VM host, ie, outside the VM. What you are saving is a large file or collection of large files, often sparse, that together are the data for the virtual machine. This has 2 special considerations: consistent backups, and choosing a block size.
Consistent VM Image Backups
If a VM image is backed up while the VM is running, the disk image files being saved may change during the backup. For non-critical VMs that don’t have a lot of activity, this isn’t a big deal and is probably fine. If you restore a VM image saved this way, it is strongly recommended to run a forced fsck if you restore it, to ensure your filesystem is okay.
To get a consistent VM backup, you can: a) suspend the VM during the backup; b) take a snapshot with the VM software, or c) take a snapshot on the VM host filesystem, for example, an LVM snapshot, and backup the snapshot. With method b), you would need to revert to the snapshot after restoring the VM. Using method b), VM snapshots, has the additional advantage that you can run with this snapshot active for a week or two, and backups will be much faster because only the snapshot has to be scanned; the main VM image file will not change. Every week or so the snapshot(s) can be deleted, reincorporating them back into the main VM image, and then the entire VM image will be scanned on the next backup.
Block Size for VM Image Backups
HashBackup normally uses a variable block size that depends on the data being saved. For files that are updated in fixed-size blocks, like VM images, a fixed block size usually gives a smaller backups and faster backup performance.
Smaller block sizes yield higher dedup, though also cause more
overhead because HB has to track more blocks. If HashBackup
recognizes the file suffix for a VM image, for example, .vmdk
, .qcow
,
etc., it will switch to a 4K block size. For higher performance,
especially with very large VM images, a larger block size like -B1M
could be used. For a "middle of the road" strategy, use -B64K
.
Important note: if you expect to dedup across several VM images
saved into the same backup directory, the most effective way to do
that is with -B4K
. The reason is that different VMs will likely
have different filesystem layouts, even if they have the same
contents. Since most filesystems allocate space in 4K chunks, the
best way to dedup across VMs is to use the same 4K block size. ` `
When choosing a block size, here are some things to keep in mind:
-
How big is the VM? For small VMs, the block size won’t make much difference, so don’t worry about it.
-
How much dedup do you expect / want? A smaller block size will dedup better, a larger block size will save and restore faster.
-
Do you expect dedup across VM images? If so, you probably have to use
-B4K
. -
If you only want to dedup a VM image against itself, from one incremental backup to the next, you can use any block size. Larger block sizes will create a larger backup because of less dedup but will run faster and restore faster. Smaller block sizes will probably create a smaller backup but take longer to backup and restore.
-
If you want to dedup a VM image backup against itself within one backup rather than between incrementals, you will need to use
-B4K
. Example: a VM image contains two identical 1 GB files. HashBackup will only reliably dedup these files with-B4K
because the individual blocks of the two files could be scattered throughout the VM image, depending on how the filesystem inside the VM decides to allocate space. -
How much data changes between backups? For a very active VM, a smaller block size will create smaller backups but a larger block size will run faster. For an inactive VM, not much data will be saved anyway, so don’t worry about the block size.
-
Do you need fast restores? A smaller block size will take longer because there will be more seeks during the restore; a larger block size will restore faster. It’s hard to quantify, so do test backups and restores to compare restore times.
-
These trade-offs are very data dependent. To see what works best for your VM images, it’s recommended to start with simulated backups (see simulated-backup config option). You can run simulated backups for a few days with one block size, then create a new simulated backup for a few days with a different block size, and compare how long the backups run vs how much backup space is used. Use
hb stats
to check backup space used by a simulated backup.
Example #1: /usr Backup
Backup of CentOS /usr
directory running in a Mac OSX VM.
First create the backup directory:
[root@cent5564 /]# hb init -c /hbbackup
HashBackup build #1605 Copyright 2009-2016 HashBackup, LLC
Backup directory: /hbbackup
Permissions set for owner access only
Created key file /hbbackup/key.conf
Key file set to read-only
Setting include/exclude defaults: /hbbackup/inex.conf
VERY IMPORTANT: your backup is encrypted and can only be accessed with
the encryption key, stored in the file:
/hbbackup/key.conf
You MUST make copies of this file and store them in a secure location,
separate from your computer and backup data. If your hard drive fails,
you will need this key to restore your files. If you setup any
remote destinations in dest.conf, that file should be copied too.
Backup directory initialized
Drop the filesystem cache to time the initial (full) backup. This backup is slower than normal because this VM is only configured for 1 core.
[root@cent5564 /]# echo 3 >/proc/sys/vm/drop_cache
[root@cent5564 /]# time hb backup -c /hbbackup -D1g -v1 /usr
HashBackup build #1605 Copyright 2009-2016 HashBackup, LLC
Backup directory: /hbbackup
Copied HB program to /hbbackup/hb#1605
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
Backing up: /hbbackup/inex.conf
Backing up: /usr
Time: 190.1s, 3m 10s
CPU: 125.5s, 2m 5s, 66%
Checked: 72311 paths, 1422698489 bytes, 1.4 GB
Saved: 72311 paths, 1227204836 bytes, 1.2 GB
Excluded: 0
Dupbytes: 175353351, 175 MB, 14%
Compression: 67%, 3.1:1
Efficiency: 6.26 MB reduced/cpusec
Space: 402 MB, 402 MB total
No errors
real 3m10.617s
user 1m13.299s
sys 0m53.212s
Drop the filesystem cache again to time an incremental backup:
[root@cent5564 /]# echo 3 >/proc/sys/vm/drop_caches
[root@cent5564 /]# time hb backup -c /hbbackup -D1g -v1 /usr
HashBackup build #1605 Copyright 2009-2016 HashBackup, LLC
Backup directory: /hbbackup
Cores: have 1, init 1, max 1
This is backup version: 1
Dedup enabled, 21% of current, 0% of max
Backing up: /hbbackup/inex.conf
Backing up: /usr
Time: 22.1s
CPU: 19.1s, 86%
Checked: 72311 paths, 1422698489 bytes, 1.4 GB
Saved: 2 paths, 0 bytes, 0
Excluded: 0
No errors
real 0m22.603s
user 0m5.949s
sys 0m13.608s
Example #2: Backup & Restore Mac Mini Server
Backup Intel i7 Mac Mini Server root drive (643,584 files in 89GB) to the other drive. Both drives are slower 5600 rpm hard drives. During this backup, HashBackup used 267MB of RAM,
sh-3.2# purge; /usr/bin/time -l hb backup -c /hbtest -v1 -D1g /
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /hbtest
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
Backing up: /
Mount point contents skipped: /dev
Mount point contents skipped: /home
Mount point contents skipped: /net
Backing up: /hbtest/inex.conf
Time: 2323.1s, 38m 43s
CPU: 2004.4s, 33m 24s, 86%
Checked: 643631 paths, 89316800581 bytes, 89 GB
Saved: 643584 paths, 89238502003 bytes, 89 GB
Excluded: 47
Dupbytes: 21638127507, 21 GB, 24%
Compression: 36%, 1.6:1
Efficiency: 15.32 MB reduced/cpusec
Space: 57 GB, 57 GB total
No errors
2325.97 real 1746.39 user 258.65 sys
267079680 maximum resident set size
3804171 page reclaims
1019 page faults
0 swaps
88820 block input operations
37261 block output operations
823626 voluntary context switches
1314791 involuntary context switches
Restore the backup - 28 minutes.
sh-3.2# purge; /usr/bin/time -l hb get -c /hbtest -v1 /
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /hbtest
Most recent backup version: 0
Restoring most recent version
Begin restore
Restoring / to /Users/jim/test
Restore? yes
Restored / to /Users/jim/test
0 errors
1672.70 real 1333.01 user 190.00 sys
276615168 maximum resident set size
1388049 page reclaims
964 page faults
0 swaps
2732 block input operations
53260 block output operations
35939 voluntary context switches
3865449 involuntary context switches*
Example #3: Incremental backup of Mac Mini with 3600 daily backups over 11 years
This is the daily backup of the HashBackup development server running on a 2010 Mac Mini (Intel Core 2 Duo 2.66GHz) with 2 SSD drives. There are 12 virtual machines hosted on this server so it averages about 80% idle. The backup of the main drive is written to the other SSD, copied to an external USB 2.0 drive, and copied to Backblaze B2 over a very slow 1Mbit/s (128KB/s) Internet connection.
HashBackup #2527 Copyright 2009-2021 HashBackup, LLC
Backup directory: /hbbackup
Backup start: 2021-11-26 02:00:02
Using destinations in dest.conf
This is backup version: 3609
Dedup enabled, 60% of current size, 12% of max size
/Library/Caches/com.apple.DiagnosticReporting.Networks.plist
/Library/Caches
...
/private/var
/private
/
Copied arc.3609.0 to usbcopy (9.5 MB 1s 8.9 MB/s)
Waiting for destinations: b2
Copied arc.3609.0 to b2 (9.5 MB 1m 7s 141 KB/s)
Checking database before upload
Writing hb.db.6321
Copied hb.db.6321 to usbcopy (178 MB 5s 32 MB/s)
Waiting for destinations: b2
Copied hb.db.6321 to b2 (178 MB 19m 21s 153 KB/s)
Copied dest.db to usbcopy (9.6 MB 0s 10 MB/s)
Waiting for destinations: b2
Copied dest.db to b2 (9.6 MB 1m 5s 147 KB/s)
Removed hb.db.6309 from b2
Removed hb.db.6310 from b2
Removed hb.db.6311 from b2
Removed hb.db.6309 from usbcopy
Removed hb.db.6310 from usbcopy
Removed hb.db.6311 from usbcopy
Time: 826.2s, 13m 46s
CPU: 963.9s, 16m 3s, 116%
Wait: 1561.9s, 26m 1s
Mem: 394 MB
Checked: 564451 paths, 79881638243 bytes, 79 GB
Saved: 155 paths, 23765297154 bytes, 23 GB
Excluded: 39
Dupbytes: 23663773091, 23 GB, 99%
Compression: 99%, 2480.9:1
Efficiency: 23.50 MB reduced/cpusec
Space: +9.5 MB, 57 GB total
No errors
Exit 0: Success
There are many VM disk images stored on this server, and a tiny change in these disk images means the whole disk image has to be backed up. In this backup, out of 79GB total, files totalling 23GB were changed. But because the amount of actual data changed was small, HashBackup’s dedup feature compressed a 23GB backup down to 9.5MB of new backup data. The backup was also copied to a USB drive for redundancy. A full-drive restore from this backup is shown on the Get page.
Example #4: Full and incremental backup of 2GB Ubuntu VM image
First backup a 2GB Ubuntu test VM image with 64K block size. This is
on a 2010 Macbook Pro, Core 2 Duo 2.5GHz, SSD system. The peak HB
memory usage during this backup was 77MB. The purge
command clears
all cached disk data from memory.
$ purge; hb backup -c hb -B64K -D1g ../testfile
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /Users/jim/hbrel/hb
Copied HB program to /Users/jim/hbrel/hb/hb#1619
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
/Users/jim/hbrel/hb/inex.conf
/Users/jim/testfile
Time: 36.9s
CPU: 61.6s, 1m 1s, 166%
Checked: 7 paths, 2011914751 bytes, 2.0 GB
Saved: 7 paths, 2011914751 bytes, 2.0 GB
Excluded: 0
Dupbytes: 95289344, 95 MB, 4%
Compression: 56%, 2.3:1
Efficiency: 17.63 MB reduced/cpusec
Space: 873 MB, 873 MB total
No errors
Now touch the VM image to force an incremental backup. No data actually changed, but HashBackup still has to scan the file for changes. Peak memory usage during this backup was 77MB.
$ touch ../testfile; purge; hb backup -c hb -D1g -B64K ../testfile
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /Users/jim/hbrel/hb
This is backup version: 1
Dedup enabled, 13% of current, 0% of max
/Users/jim/testfile
Time: 15.1s
CPU: 18.4s, 121%
Checked: 7 paths, 2011914751 bytes, 2.0 GB
Saved: 6 paths, 2011914240 bytes, 2.0 GB
Excluded: 0
Dupbytes: 2011914240, 2.0 GB, 100%
Compression: 99%, 49119.0:1
Efficiency: 104.30 MB reduced/cpusec
Space: 40 KB, 873 MB total
No errors
Try the same backup and incremental, this time with a variable block size. This gives 2x more dedup and a smaller backup, but 5% higher runtime. Peak memory usage again was around 78MB.
$ purge; hb backup -c hb -D1g ../testfile
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /Users/jim/hbrel/hb
Copied HB program to /Users/jim/hbrel/hb/hb#1619
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
/Users/jim/hbrel/hb/inex.conf
/Users/jim/testfile
Time: 40.9s
CPU: 66.7s, 1m 6s, 163%
Checked: 7 paths, 2011914751 bytes, 2.0 GB
Saved: 7 paths, 2011914751 bytes, 2.0 GB
Excluded: 0
Dupbytes: 162587404, 162 MB, 8%
Compression: 57%, 2.4:1
Efficiency: 16.56 MB reduced/cpusec
Space: 853 MB, 853 MB total
No errors
Now an incremental backup with variable block size. This shows that for VM images, variable-block incremental backup is slower than fixed-block backup because data has to be scanned byte-by-byte.
$ purge; touch ../testfile; hb backup -c hb -D1g ../testfile
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /Users/jim/hbrel/hb
This is backup version: 1
Dedup enabled, 16% of current, 0% of max
/Users/jim/testfile
Time: 21.4s
CPU: 24.8s, 115%
Checked: 7 paths, 2011914751 bytes, 2.0 GB
Saved: 6 paths, 2011914240 bytes, 2.0 GB
Excluded: 0
Dupbytes: 2011914240, 2.0 GB, 100%
Compression: 99%, 40932.5:1
Efficiency: 77.36 MB reduced/cpusec
Space: 49 KB, 853 MB total
No errors
Let’s compare how fast dd
can read the test VM image vs HashBackup,
both with a block size of 64K. The results below show dd
reads at
233 MB/sec on this system, HashBackup reads and compares to the
previous backup at 133 MB/sec (if no data changes). The difference
between the full backup time where all data is saved, and the
incremental time where no data is saved, gives an idea of the time
HashBackup needs to backup changed data. Based on this, HashBackup
can backup 1% of changed data for this file in about .217 seconds, or
2.17 seconds if 10% of data changed. This is added to the incremental
time with no changes to estimate the backup time with changed data.
You can use similar formulas to estimate backup times of large VM
files.
IMPORTANT: these tests are with a 2010 Intel 2.5GHz Core2Duo CPU. More recent CPUs will have faster backup times.
$ purge; dd if=../testfile of=/dev/null bs=64k
30699+1 records in
30699+1 records out
2011914240 bytes transferred in 8.638740 secs (232894413 bytes/sec)
# compute how fast HashBackup did an incremental backup and
# estimate how long it takes to backup 1% changed data
$ bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
2011914240/15.1
133239353.642 <== HB incremental backup rate was 133.23 MB/sec for unchanged VM image
(36.9-15.1)/100
.218 <== approx seconds to backup 1% changed data in the test VM image
Do an incremental backup with 10% of the 2GB VM image data changed at random, to verify the expected runtime of 21.8 seconds for 10% changed data:
$ purge; time hb backup -c hb -D1g -B64K ../testfile
Backup directory: /Users/jim/hbrel/hb
Group test & size: 262144 1048576
This is backup version: 2
Dedup enabled, 13% of current, 0% of max
/Users/jim/testfile
Time: 17.2s <== runtime is a bit lower than expected - yay!
CPU: 23.1s, 134%
Checked: 7 paths, 2011914751 bytes, 2.0 GB
Saved: 6 paths, 2011914240 bytes, 2.0 GB
Excluded: 0
Dupbytes: 1816027136, 1.8 GB, 90% <== shows 10% changed data
Compression: 95%, 22.2:1
Efficiency: 79.42 MB reduced/cpusec
Space: 90 MB, 964 MB total
No errors
NetApp Snapshot Backups
NetApp Filers have a built-in snapshot capability. Snapshots can be
taken automatically with generated pathnames, or manually with a
specified pathname. An example of an automatic name would be
/nfs/snapshot.0
, /nfs/snapshot.1
, etc., with the higher number
being the most recent snapshot. Saving the highest snapshot is a
problem for HashBackup because the pathname changes on every backup,
causing unreasonable metadata growth in the hb.db
database file.
To make efficient backups, use a bind mount to make the snapshot directory appear at a fixed name:
$ mount --bind /nfs/snapshot.N /nfstree
(Linux)
Then backup /nfstree
with HashBackup.
Dirvish Backups
Dirvish is an open-source backup tool that uses rsync to create
hard-linked backup trees. Each tree is a complete snapshot of a
backup source, but unchanged files are actually hard links. This
saves disk space since file data is not replicated in each tree. The
problem is that these trees often include a timestamp in the pathname.
If the trees are backed up directly, every pathname is unique. This
causes unreasonable metadata growth in the hb.db
file, that leads to
excessive RAM usage.
For more efficient backups, mount the tree directory (the directory containing user files) to a fixed pathname with a Linux bind mount. See the NetApp section above for details.
A second issue with hard-link backups is that HashBackup maintains an
in-memory data structure for hard link inodes so that it can properly
relate linked files. This data structure is not very memory efficient
and uses ~165 bytes for each inode that is a hard link. This can be
disabled with the --no-ino
option, though that disables all
hard-link processing.
Rsnapshot Backups
Rsnapshot is an rsync-based backup tool that creates hard-linked
backup trees. A big difference is that with rsnapshot, the most
recent backup is contained in the .0
directory, ie, the most recent
backup is at a fixed mount point. HashBackup can be used to backup
this .0
directory every night. You should avoid backing up the other
rsnapshot directories. If you need to save the other directories:
-
rename
backup.0
tobackup.0.save
-
rename the rsnapshot directory you want to backup as
backup.0
-
save
backup.0
-
rename the directory back to its original name
-
repeat these steps for every rsnapshot directory you want to backup
-
when finished, rename
backup.0.save
tobackup.0
-
going forward, only save
backup.0
with HashBackup -
instead of renaming you can use the bind mount trick (see the NetApp section)
Since rsnapshot backups are hard-link trees, nearly every file is a
hard link and you may want to consider using --no-ino
to lower
memory usage. This disables all hard-link processing in HashBackup.
Archiving
HashBackup is designed as a backup solution, where a collection of files are saved on a rather frequent basis: daily, monthly, etc. An archive is a special kind of backup that is more of a one-time event, for example, saving a hard drive or server that is going to be decomissioned. It could be 5 years (or never!) before the archive is accessed again.
HashBackup uses a database to store information about a backup. The format of this database is periodically changed as new features are added, and any structural changes are automatically applied as necessary when a new release of HB accesses a backup. HB can automatically update any backup as long as it is accessed within the last year or so with the latest version. If it has been more than a year, the latest version of HB may or may not be able to automatically update a backup’s database; an older version may need to be used first to partially upgrade the database, then a later version used to fully upgrade the database.
One easy way to keep an archive’s database up to date is to create a
cron job to backup a very small file every month. This file doesn’t
have to be related in any way to the original archive - a dummy file
in /tmp
is fine. Also include an hb upgrade
in the cron job.
Together these will ensure that your copy of the HB program stays
up-to-date and that your archive’s database is automatically upgraded
as necessary.
Another way to handle archives is to always run the version of HB contained within the backup directory.
Syncing a New Destination
When a new destination is added to dest.conf
, HashBackup will
automatically sync existing backup data to the new destination. This
can be somewhat challenging if it will take days or weeks to get the
new destination in sync, either because upload bandwidth is limited or
the backup is already very large. An easy way to handle this is to
use the --maxwait and/or --maxtime options to limit the amount of time
backup spends trying to get your new destination synced. For example,
if backup normally takes an hour to run and starts at midnight,
--maxtime 7h will stop the backup around 7am. The next day backup
will continue syncing data to the new destination.
Another option is to allow backup to run most of the day. To avoid
interfering with normal network activity, use workers 1
in
dest.conf
so only one file is uploaded at a time, and use the rate
keyword to limit upload bandwidth used. Once your new destination is
in sync, these restrictions can be removed since incremental backups
are usually small.