Commands‎ > ‎

Backup

Performs a backup.  Advanced options are described later, but commonly-used options are:

  $ hb backup [-c backupdir]
[-D mem]
            
[-v level] -X [-t tagname]
             path1 path2 ...


-c  the backup is stored in the local backup directory specified with the -c option or a default backup directory according to built-in rules if -c isn't used.  If you don't want a local copy of the backup, set cache-size-limit with the config command and create a dest.conf file in the backup directory.  See Destinations for details and examples.

-D  enable dedup, saving only 1 copy of identical data blocks .  -D is followed by the amount of memory you want to dedicate to the  dedup operation, for example:  -D1g would use 1 gigabyte of RAM, -D100m would use 100mb.  The more memory you have available, the more duplicate data HashBackup can detect, though it is still effective even with limited memory.  Also enabled with the dedup-mem config option.  See Dedup Info for detailed information and recommendations.  -D0 disables dedup for this backup.

-v  control how much output is displayed, higher numbers for more:
  • -v0 = print version, backup directory, copy messages, error count
  • -v1 = print names of skipped filesystems
  • -v2 = print names of files backed up (this is the default)
  • -v3 = print names of excluded files
-X  backup will not cross mount points (separate file systems) unless -X is used, so you have to explicitly list on the command line each filesystem you want to backup.  Use the df command to see a list of your filesystems.  To see what filesystem a directory is on, use df followed by a pathname, or  use df . to see the current directory's filesystem.
 
-t  tag the backup with information-only text shown with hb versions


Each backup creates a new version in the backup directory containing all files modified since the previous backup.  An easy way to think of this is a series of incremental backups, stacked one on top of the other.  HashBackup presents the illusion that every version is a full backup, while performing the backups at incremental speeds and using incremental backup storage space and bandwidth.  While full backups are an option, there is no requirement to do them.  HashBackup is designed to not need full backups and you will experience huge savings of time, disk space, and bandwidth compared to traditional backup methods.  The retain command explains how to maintain weekly or monthly snapshots
- or both!
 
Specifying the Backup Directory

All commands accept a -c option to specify the backup directory.  HashBackup will store your backup data and the encryption key in this directory.  If -c is not specified on the command line, the environment variable HASHBACKUP_DIR is checked.  If this environment variable doesn't exist, the directory ~/hashbackup is used if it exists.  If ~/hashbackup doesn't exist, /var/hashbackup is used.  Backup directories must be first be initialized with the hb init command.
  If you do not want a complete local copy of your backup, set the cache-size-limit config option with the config command to limit the size of the local backup directory.

HashBackup saves all of your files' properties in a database, so backups can be stored on any type of "dumb" storage: FAT, VFAT, CIFS/SMB/Samba, NFS, USB thumb drives, SSHFS, WebDAV, FTP servers, etc.  All file attributes, including hard links, ACLs, and extended attributes will be accurately saved and restored, even if not supported by the storage system.  An encrypted incremental backup of the backup database is automatically included with every backup.

It's possible to backup directly to a mounted remote filesystem or USB drive with the -c option.  However, the backup performance may be slower than with a local filesystem, and they may not provide robust sync and lock facilities.  Sync facilities are used to ensure that the backup database doesn't get corrupted if your computer halts unexpectedly during the backup process, for example, if there is a power outage.  Lock facilities ensure that two backups don't run at the same time, possibly overwriting each others' backup files.  It's best to specify a local backup directory with -c and let HashBackup copy the backup data to mounted remote storage or USB drive by using a Dir destination in the dest.conf file in your backup directory.

Destinations

Remote destinations are optional.  They are setup in the dest.conf file, located in the backup directory.  See the Destinations page for more detailed info on how to configure destinations for SSH, RSYNC, and FTP servers, Amazon S3, Google Storage, Backblaze, Gmail, IMAP/email servers, and simple directory (dir) destinations.  Dir destinations are used to copy your backup to USB thumb drives, NFS, WebDAV, Samba, and any other storage that can be mounted as a filesystem.

The backup command creates backup archive files in the backup directory.  As archives fill up, they can be transmitted to one or more remote destinations, configured with a dest.conf file.  Transmissions run in parallel with the backup process.  After the backup has completed, it waits for all files to finish transmitting.  By default, archives are kept in the local backup directory after sending to remote destinations, making restores very fast because data does not have to be downloaded from a remote site.  Keeping a local copy also adds redundancy and makes it easier to pack remote archives (remove deleted or expired backup data) because archives don't have to be downloaded first.  It is possible to delete archives from the local backup directory as they are sent to destinations by using the cache-size-limit config option.

If you are storing your backup directly to a USB or other removable drive with the -c option, and that's the only backup copy you want, you do not need to use the dest.conf file at all.  dest.conf is only used when you want to copy your backup, usually to a remote site.

Important Security Note: your encryption key is stored in the file key.conf in the backup directory.  Usually this is a local directory, but if you are writing your backup directly to mounted remote storage with the -c option, for example, Google Drive, DropBox, etc., or to a removable USB drive, be sure to set a passphrase when initializing your backup directory.  This is done with the -p ask option to hb init or hb rekey.  Both the passphrase and key are required to access your backup.  If you are writing directly to mounted remote storage that you control, such as an NFS server, a passphrase may not be necessary.  If you are using a dest.conf Dir destination to copy your backup to remote storage or a removable drive, but your backup directory and key are on a local drive, a passphrase is not necessary because the key is never copied to destinations.

Excludes


A handful of system files and directories are automatically excluded, like swapfiles, /tmp, core files, /proc, /sys, and hibernation files. 
/var/hashbackup and the -c backup directory itself are also automatically excluded.  An inex.conf file is created by the init command in the backup directory showing which files are excluded.
 

To exclude other files, edit the inex.conf file in the backup directory.  The format of this file is:

# comment lines
E(xclude) <pathname>

Example inex.conf:

# exclude all .wav files:
Exclude *.wav

# exclude the /Cache directory and its contents:
e /Cache

# save the /Cache directory itself, but don't save the contents.
# Requesting a backup of /Cache/xyz still works.
e /Cache/

# save the /Cache directory itself, but don't save the contents.
# Requesting a backup of /Cache/xyzdir will save the directory itself
# since it was explicitly requested, but will not save xyzdir's
# contents
e /Cache/*

Any abbreviation of the exclude keyword at the beginning of the line is recognized.  Tilde characters ~ are not expanded into user directories.  

There are several other ways to exclude files from the backup:
  • config option no-backup-tag can be set to a list of filenames, separated by commas.  If a directory contains any of these files, only the directory itself and the tag files will be saved.  HB does not read or check the contents of tag files.  Typical values here are .nobackup or CACHEDIR.TAG
  • files with the nodump flag set are not backed up
  • config option no-backup-ext can be set to a list of comma-separated extensions, with or without dots.  Any files ending in one of the extensions listed is not backed up.  This test is done without regard to uppercase or lowercase, so avi will exclude both .avi and .AVI files
Web browser caches are difficult to backup for a number of reasons (see Browser Cache note), and should always be excluded from backups, especially on large servers that store browser caches for many users.

Advanced Options

The backup command has a few more options that may be useful:

-m maxfilesize  any file larger than maxfilesize will be skipped.  The size can be specified with values like 100m or 100mb for 100 megabytes, 3g or 3gb for 3 gigabytes, etc.

-p procs  specifies how many additional processes to use for the backup.  Normally HashBackup uses 1 process on single-CPU systems, and several processes on multi-core systems with more than 1 CPU.  Multiple processes speed up the backup but also put a higher load on your system.  To reduce the performance impact of a backup on your system, you may want to use -p0 to force using only 1 CPU.  Or, with a very fast hard drive and many CPU cores, you may want to use more than just a few processes.  On a dual-core system, HashBackup will use both cores; on a 4-core or more system it will use 4.  To use more than 4 cores, -p must be used.  Backup also displays %CPU utilization at the end of the backup.  If you are using -p4 and %CPU is 170%, it means that on average, HashBackup could only make use of 1.7 CPUs because of disk I/O, network, or other factors, so -p4 may be overkill.  Or, it may make sense if you have one huge file that can take advantage of the extra CPU horsepower, even though most files cannot. 
Experimenting is the best guide here.
 

-B blocksize  the backup command splits files into blocks of data.  It chooses a "good" block size based on the type of file being saved and by default uses variable-sized blocks averaging 48K.  For some backups you may want to force a specific, fixed block size, for example, VM images or raw database files (but not text dumps!) that are updated in fixed block sizes.  The block sizes allowed are: 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M 4M.  In general: -B forces a fixed block size, which usually dedups less than a variable block size; larger block sizes are more efficient but dedup less; small block sizes cause more overhead but dedup better.  Sometimes the extra dedup savings of a small block size is less than the extra metadata needed to track more blocks.  Since this is very data dependent, experiment with your own data to determine the best block size to use.  Using -B1m usually speeds up backups on a single CPU system because it disables variable block sizes.  -B4M is useful when files do not have much common data, for example media files, and if there are identical files, backup will only save 1 copy.  Using the simulated-backup config option with different block sizes is very helpful for determining the best block size.

--full  the backup command usually backs up only modified files, and determines this by looking at mtime, ctime, and other file attributes for each file.  This option overrides that behavior and causes every file to be saved, even if it has not changed.  Taking a full backup adds redundancy to your backup data.  Another way to accomplish this is to create a new backup directory on both local and remote storage, but this is harder to manage and retain will not work across multiple backup directories.  Full backups also limit the number of incremental archives required to do large restores, though in testing, restoring a complete OSX system from 3 years of daily incrementals took only 15% longer than restoring from a full backup.

--maxtime time  specifies the maximum time to spend actually saving files.  When this time is exceeded, the backup stops and waits for uploads to finish using --maxwait, which is adjusted based on how long the backup took. 

--maxwait time    when used by itself, specifies the maximum time to wait for files to be transmitted to remote destinations.  When used with --maxtime, the wait time is reduced by the time taken to create the backup.

These two options are useful for huge initial backups, which take much longer than incremental backups.  They allow huge backups to span many days without going outside the time reserved for backups.  They also prevent incrementals from running into production time when a large amount of data changes for some reason.  If a backup does not complete because of --maxtime, the next backup using the same command line will restart where the previous backup left off, without rescanning the old data.  This allows backups of huge filesystems with tens of millions of files over a period of time.  Be aware that these time limits are not very accurate for various technical reasons.  Large arc files, large files, slow upload rates, and many worker threads cause more variability in the timing.  It's a good idea to set the time limits an hour less than you really want to get a feel for how much variability occurs in your backup environment.

Examples:


    --maxtime 1h means backup for up to 1 hour then wait the remainder of the hour to upload the data, ie, total time is limited to 1 hour

    --maxwait 1h means backup everything requested, however long it takes, but only wait 1 hour for uploads to finish

    --maxtime 1h --maxwait 1h means backup for up to 1 hour, then wait 1 hour + the remaining backup time for uploads to finish, ie, the total time is limited to 2 hours

    --maxtime 1h --maxwait 1y means backup for 1 hour, then wait a year for uploads to finish, ie, only the backup time is limited

--no-ino  on Unix filesystems, every file has a unique inode number.  If two paths have the same inode number, they are hard linked and both paths refer to the same file data.  Some filesystems do not have real inode numbers: FUSE (sshfs, UnRaid) and Samba/SMB/CIFS for example.  Instead, they invent inode numbers and cache them for a while, sometimes for many days.  HashBackup normally verifies a path's inode number during backups, and if it changes, it triggers a backup of the path.  Inode numbers are also used to detect hard links.  If a filesystem does not have real inode numbers, it causes unexpected and unpredictable full backups of files that haven't changed and HashBackup will stop with an error.  The --no-ino option prevents these unnecessary full backups.

A negative side-effect is that hard-link detection is disabled since it relies on stable inode numbers.  With --no-ino, hard-linked files are backed up like regular files and if restored, are not hard-linked.
  

--no-mtime  there are two timestamps on every Unix file: mtime is the last time a file's data was modified, and ctime is the last time a file's attributes were modified (permissions, etc).  When a file's data is changed, both mtime and ctime are updated.  If only the attributes are changed, only ctime is updated.  Because mtime can be set by user programs, some administrators may not trust it to indicate that a file's data has not changed.  The --no-mtime option tells the backup command to verify file contents have not changed by computing a strong checksum rather than trusting mtime.  This is not recommended for normal use since the backup program has to read every unmodified file's data to compute the checksum, then possibly read it again to backup the file.

-X  backup saves all paths listed on the command line, even if they are on different filesystems.  But while saving / (root) for example, backup will not descend into other mounted filesystems.  It will save other filesystems if -X is used.  Be careful: it's easy to backup an NFS server or external drive unintentionally with -X.

-Z level  sets the compression level.  The compression level is 0 to 6, where 0
means no compression and 7-9 are reserved.  The default is 6, the highest compression, so use -Z for slightly faster but less compression.  Compression can be disabled by file extension with the no-compress-ext config variable.  See the config command for details.  Disabling compression with -Z0 usually increases backup space, disk writes, and backup transmission time, and is almost always slower than -Z1

Raw Block Device, Partition, and Logical Volume Backups

If a block device name is used on the backup command line, all data stored on the block device will be stored.  The block device should either be unmounted or mounted read-only before taking the backup, or, with logical volumes, a snapshot must be taken and then the snapshot backed up.  To maximize dedup, a block size of 4K is good for most block device backups and is the default.  For very large devices, a block size of 64K or 1M will run faster, but may yield a bigger backup.  You have to experiment with your data to decide.

On Linux, logical volumes are given symbolic links that point to the raw device.  When a symbolic link is listed on the backup command line, its target (the raw device) will also be saved.

Names Pipes / Fifo Backups

If a named pipe or fifo is used on the backup command line, the fifo is opened and all data read from it is backed up.  For example, this can be used for database dumps: instead of dumping to a huge text file, then backing up the text file, you can direct the dump output to a fifo and let HB back up the dump directly, without needing the intermediate text file.

$ mkfifo fifo
$ cat somefile >fifo & hb backup -c backupdir fifo

Be very careful not to start more than one process writing to the same fifo!  If this happens, the data from the two processes is intermixed at random, and HB cannot tell this is happening.  The resulting backup is unusable.

On Mac OSX and BSD, fifo backups are about half the speed of backups from a regular file because the fifo buffers are small.  With Linux, fifo backups can be faster than regular file backups.

Simulated Backups

It is often difficult to decide the best way to back up your specific set of data.  You may have millions of small files, very large files, VM disk images, photos, large mail directories, large databases, or many other types of data.  Different block sizes (-B option), arc file sizes (arc-size-limit config keyword), dedup table sizes (-D keyword), and arc file packing (pack-remote-archives, pack-percent-free config keywords) will all affect the size of your backup.  Your retention policy will also be a factor.  Determining all of these variables can be difficult.

To make configuring your backup easier, HB has simulated backups: the backup works as usual, but no backup data is written.  The metadata about the backup - file names, blocks, arc files, etc. - is still all tracked.  You can do daily incremental backups, remove files from the backup, and perform file retention, just like a normal backup, while using very little disk space.  The hb stats command is then used to see how much space a real backup would use.  You may want to try one configuration for a week or two, put that aside, try a different configuration for a while, and then compare the hb stats output for the two backups to see which works better for your data.

To create a simulated backup, use: hb config -c backupdir simulated-backup True before your first backup.  Once this option has been set and a backup created with it, it cannot be removed.  It also cannot be set on an existing backup.  Then, use normal HB commands as if this were a real backup, and use the stats command to see the results.

Some commands like get and mount will fail with a simulated backup since there is no real backup data.

Sparse Files

A sparse file is a somewhat unusual file where disk space is not fully allocated.  These are often used for virtual disk drive images and are sometimes called "thin provisioning".  An ls -ls command will show a number on the left, the space actually allocated to the file in K-bytes, and the size of the file in bytes, which may be much larger.  When supported by the OS, HB will backup a sparse file efficiently by skipping the unallocated file space, also called "holes", rather than backing up a long string of zero bytes.  If the OS or filesystem does not support sparse file hole skipping, the file will be saved normally.  In either case, a sparse file is always restored as a sparse file with holes.  Any -B block size can be used (or none, and HB will pick a block size). 

Sparse files are supported on Linux and BSD; Mac OS does not support sparse files.  Here is an example of creating a 10GB sparse file, backing it up, and restoring it:

$ echo abc|dd of=sparsefile bs=1M seek=10000
0+1 records in
0+1 records out
4 bytes (4 B) copied, 0.000209 seconds, 19.1 kB/s

$ hb backup -c hb -D1g -B4k sparsefile
HashBackup build #1463 Copyright 2009-2016 HashBackup, LLC
Backup directory: /home/jim/hb
Copied HB program to /home/jim/hb/hb#1463
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
/home/jim/sparsefile

Time: 1.0s
Checked: 5 paths, 10485760004 bytes, 10 GB
Saved: 5 paths, 4 bytes, 4 B
Excluded: 0
Sparse: 10485760000, 10 GB
Dupbytes: 0
Space: 64 B, 139 KB total
No errors

$ mv sparsefile sparsefile.bak

$ time hb get -c hb `pwd`/sparsefile
HashBackup build #1463 Copyright 2009-2016 HashBackup, LLC
Backup directory: /home/jim/hb
Most recent backup version: 0
Restoring most recent version

Restoring sparsefile to /home/jim
/home/jim/sparsefile
Restored /home/jim/sparsefile to /home/jim/sparsefile
No errors

real    0m0.805s
user    0m0.150s
sys     0m0.100s

$ ls -ls sparsefile*
20 -rw-rw-r-- 1 jim jim 10485760004 Feb  8 18:03 sparsefile
20 -rw-rw-r-- 1 jim jim 10485760004 Feb  8 18:03 sparsefile.bak


Virtual Machine Backups

There are 2 ways to backup a VM: run HashBackup inside the VM as an ordinary backup, or run HashBackup from the VM host machine.  Running HB inside the VM is just like any other backup and doesn't have special considerations.  VM images can be backed up while running HashBackup on the VM host, ie, outside the VM.  What you are saving is a large file or collection of large files, often sparse, that together are the VM data for the virtual machine.  This has 2 special considerations: consistent backups, and choosing a block size.

Consistent VM Image Backups

If a VM image is backed up while the VM is running, the disk image files being saved may change during the backup.  For non-critical VMs that don't have a lot of activity, this isn't a big deal and is probably fine.  If you restore a VM image saved this way, it is strongly recommended to run a forced fsck if you restore it, to ensure your filesystem is okay.

To get a consistent VM backup, you can: a) suspend the VM during the backup; b) take a snapshot with the VM software, or c) take a snapshot on the VM host filesystem, for example, an LVM snapshot, and backup the snapshot.  With method b),  you would need to revert to the snapshot after restoring the VM.

Block Size for VM Image Backups

HashBackup normally uses a variable block size that depends on the data being saved.  For files that are updated in fixed-size blocks, like VM images, a fixed block size may give a smaller backup and usually gives faster backup performance.

Smaller block sizes give higher dedup, though also cause more overhead because HB has to track more blocks.  If HashBackup recognizes the file suffix for a VM image, for example, .vmdk, .qcow, etc., it will switch to a 4K block size.  For higher performance, especially with very large VM images, a larger block size like -B1M could be used.  For a "middle of the road" strategy, use -B64K.  

Important note: if you expect to dedup across several VM images, the most effective way to do that is with -B4K.  The reason is that different VMs will likely have different filesystem layouts, even if they have the same contents.  Since most filesystems allocate space in 4K chunks, the best way to dedup across VMs is to use the same 4K block size.    

When choosing a block size, here are some things to keep in mind:
  • How big is the VM?  For small VMs, the block size won't make much difference, so don't worry about it.
  • How much dedup do you expect / want?  A smaller block size will dedup better, a larger block size will save faster.
  • Do you expect dedup across VM images?  If so, you probably have to use -B4K.
  • If you only want to dedup a VM image against itself, from one incremental backup to the next, you can use any block size.  Larger block sizes will create a larger backup but will run faster and restore faster.  Smaller block sizes will probably create a smaller backup, take longer to run, and will take longer to restore.
  • If you want to dedup a VM image backup against itself within one backup rather than between incrementals, you will need to use -B4K.  Example: a VM image contains two identical 1 GB files.  HashBackup will only reliably dedup these files with -B4K because the individual blocks of the two files could be scattered throughout the VM image, depending on how the filesystem inside the VM decides to allocate space.
  • How much data changes between backups?  For a very active VM, a smaller block size will create smaller backups but a larger block size will run faster.  For an inactive VM, not much data will be saved anyway, so don't worry about the block size.
  • Do you need fast restores?  A smaller block size will take longer because there will be more seeks during the restore; a larger block size will restore faster.  It's hard to quantify, so do test backups and restores to compare restore times.
  • These trade-offs are very data dependent.  To see what works best for your VM images, its recommended to start with simulated backups (see simulated-backup config option).  You can run simulated backups for a few days with one block size, then create a new simulated backup for a few days with a different block size, and compare how long the backups run vs how much backup space is used.  Use hb stats to check backup space used by a simulated backup.

Example Backup #1

Backup of CentOS /usr directory running in a Mac OSX VM.
First create the backup directory:

[root@cent5564 /]
# hb init -c /hbbackup
HashBackup build #1605 Copyright 2009-2016 HashBackup, LLC
Backup directory: /hbbackup
Permissions set for owner access only
Created key file /hbbackup/key.conf
Key file set to read-only
Setting include/exclude defaults: /hbbackup/inex.conf

VERY IMPORTANT: your backup is encrypted and can only be accessed with
the encryption key, stored in the file:
    /hbbackup/key.conf
You MUST make copies of this file and store them in a secure location,
separate from your computer and backup data.  If your hard drive fails,
you will need this key to restore your files.  If you setup any
remote destinations in dest.conf, that file should be copied too.
       
Backup directory initialized

Drop filesystem cache and do the initial (full) backup.
This backup is slower than normal because the VM is only configured for 1 core.

[root@cent5564 /]# echo 3 >/proc/sys/vm/drop_caches

[root@cent5564 /]# time hb backup -c /hbbackup -D1g -v1 /usr
HashBackup build #1605 Copyright 2009-2016 HashBackup, LLC
Backup directory: /hbbackup
Cores: have 1, init 1, max 1
Copied HB program to /hbbackup/hb#1605
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
Backing up: /hbbackup/inex.conf
Backing up: /usr

Time: 190.1s, 3m 10s
CPU:  125.5s, 2m 5s, 66%
Checked: 72311 paths, 1422698489 bytes, 1.4 GB
Saved: 72311 paths, 1227204836 bytes, 1.2 GB
Excluded: 0
Dupbytes: 175353351, 175 MB, 14%
Compression: 67%, 3.1:1
Efficiency: 6.26 MB reduced/cpusec
Space: 402 MB, 402 MB total
No errors

real    3m10.617s
user    1m13.299s
sys    0m53.212s

Drop filesystem cache again to time an incremental backup:

[root@cent5564 /]# echo 3 >/proc/sys/vm/drop_caches

[root@cent5564 /]# time hb backup -c /hbbackup -D1g -v1 /usr
HashBackup build #1605 Copyright 2009-2016 HashBackup, LLC
Backup directory: /hbbackup
Cores: have 1, init 1, max 1
This is backup version: 1
Dedup enabled, 21% of current, 0% of max
Backing up: /hbbackup/inex.conf
Backing up: /usr

Time: 22.1s
CPU:  19.1s, 86%
Checked: 72311 paths, 1422698489 bytes, 1.4 GB
Saved: 2 paths, 0 bytes, 0
Excluded: 0
No errors

real    0m22.603s
user    0m5.949s
sys    0m13.608s



Example #2 Backup & Restore Mac Mini Server

Backup Intel i7 Mac Mini Server root drive (643,584 files in 89GB) to the other drive.
Both drives are slower 5600 rpm hard drives.
During this backup, HashBackup used 267MB of RAM,

sh-3.2# purge; /usr/bin/time -l hb backup -c /hbtest -v1 -D1g /
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /hbtest
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
Backing up: /
Mount point contents skipped: /dev
Mount point contents skipped: /home
Mount point contents skipped: /net
Backing up: /hbtest/inex.conf

Time: 2323.1s, 38m 43s
CPU:  2004.4s, 33m 24s, 86%
Checked: 643631 paths, 89316800581 bytes, 89 GB
Saved: 643584 paths, 89238502003 bytes, 89 GB
Excluded: 47
Dupbytes: 21638127507, 21 GB, 24%
Compression: 36%, 1.6:1
Efficiency: 15.32 MB reduced/cpusec
Space: 57 GB, 57 GB total
No errors

     2325.97 real      1746.39 user       258.65 sys
 267079680  maximum resident set size
   3804171  page reclaims
      1019  page faults
         0  swaps
     88820  block input operations
     37261  block output operations
    823626  voluntary context switches
   1314791  involuntary context switches


Restore the backup - 28 minutes.

sh-3.2# purge; /usr/bin/time -l hb get -c /hbtest -v1 /
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /hbtest
Most recent backup version: 0
Restoring most recent version
Begin restore

Restoring / to /Users/jim/test
Restore? yes
Restored / to /Users/jim/test
0 errors
     1672.70 real      1333.01 user       190.00 sys
 276615168  maximum resident set size
   1388049  page reclaims
       964  page faults
         0  swaps
      2732  block input operations
     53260  block output operations
     35939  voluntary context switches
   3865449  involuntary context switches



Example 3: Incremental backup of Mac Mini with 1746 daily backups over 6 years


This is the daily backup of a 2010 Mac Mini development server (Intel Core 2 Duo 2.66GHz) with 2 hard drives.  There are 12 virtual machines hosted on this server so it averages about 80% idle.  (This is an older version of HashBackup; current versions are faster.)

sh-3.2# hb backup -c /backup /
HashBackup build #1496 Copyright 2009-2016 HashBackup, LLC
Backup directory: /backup
Using destinations in dest.conf
This is backup version: 1746
Dedup enabled, 58% of current, 12% of max
/
... (pathnames backed up are listed here)

Time: 1158.5s, 19m 18s
Wait: 87.9s, 1m 27s
Checked: 579013 paths, 82665632485 bytes, 82 GB
Saved: 207 paths, 26949234198 bytes, 26 GB
Excluded: 33
Dupbytes: 26726881567, 26 GB, 99%
Compression: 99%, 727.7:1
Space: 37 MB, 96 GB total
No errors
Exit 0: Success

There are many VM disk images stored on this server, and a tiny change in these disk images means the whole disk image has to be backed up.  In this backup, out of 82GB total, files totalling 26GB were changed.  But because the amount of actual data changed was small, HashBackup's dedup feature compressed a 26GB backup down to 37MB of new backup data.  The backup was also copied to a USB drive for redundancy.  A full-drive restore from this backup is shown on the Get page.


Example 4: Full and incremental backup of 2GB Ubuntu VM image

First backup a 2GB Ubuntu test VM image with 64K block size.  This is on a 2010 Macbook Pro, Core 2 Duo 2.5GHz, SSD system.  The peak HB memory usage during this backup was 77MB.  The purge command clears all cached disk data from memory. 

$ purge; hb backup -c hb -B64K -D1g ../testfile
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /Users/jim/hbrel/hb
Copied HB program to /Users/jim/hbrel/hb/hb#1619
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
/Users/jim/hbrel/hb/inex.conf
/Users/jim/testfile

Time: 36.9s
CPU:  61.6s, 1m 1s, 166%
Checked: 7 paths, 2011914751 bytes, 2.0 GB
Saved: 7 paths, 2011914751 bytes, 2.0 GB
Excluded: 0
Dupbytes: 95289344, 95 MB, 4%
Compression: 56%, 2.3:1
Efficiency: 17.63 MB reduced/cpusec
Space: 873 MB, 873 MB total
No errors

Touch the VM image to force an incremental backup.  No data actually changed, but HashBackup still has to scan the file for changes.  Peak memory usage during this backup was 77MB. 

$ touch ../testfile; purge; hb backup -c hb -D1g -B64K ../testfile
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /Users/jim/hbrel/hb
This is backup version: 1
Dedup enabled, 13% of current, 0% of max
/Users/jim/testfile

Time: 15.1s
CPU:  18.4s, 121%
Checked: 7 paths, 2011914751 bytes, 2.0 GB
Saved: 6 paths, 2011914240 bytes, 2.0 GB
Excluded: 0
Dupbytes: 2011914240, 2.0 GB, 100%
Compression: 99%, 49119.0:1
Efficiency: 104.30 MB reduced/cpusec
Space: 40 KB, 873 MB total
No errors

Try the same backup and incremental, this time with a variable block size.  This gives 2x more dedup and a smaller backup, but 5% higher runtime.  Peak memory usage again was around 78MB.

$ purge; hb backup -c hb -D1g ../testfile
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /Users/jim/hbrel/hb
Copied HB program to /Users/jim/hbrel/hb/hb#1619
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
/Users/jim/hbrel/hb/inex.conf
/Users/jim/testfile

Time: 40.9s
CPU:  66.7s, 1m 6s, 163%
Checked: 7 paths, 2011914751 bytes, 2.0 GB
Saved: 7 paths, 2011914751 bytes, 2.0 GB
Excluded: 0
Dupbytes: 162587404, 162 MB, 8%
Compression: 57%, 2.4:1
Efficiency: 16.56 MB reduced/cpusec
Space: 853 MB, 853 MB total
No errors

Now an incremental backup with variable block size.  This shows that for VM images, variable-block incremental backup is slower than fixed-block backup because data has to be scanned byte-by-byte.

$ purge; touch ../testfile; hb backup -c hb -D1g ../testfile
HashBackup build #1619 Copyright 2009-2016 HashBackup, LLC
Backup directory: /Users/jim/hbrel/hb
This is backup version: 1
Dedup enabled, 16% of current, 0% of max
/Users/jim/testfile

Time: 21.4s
CPU:  24.8s, 115%
Checked: 7 paths, 2011914751 bytes, 2.0 GB
Saved: 6 paths, 2011914240 bytes, 2.0 GB
Excluded: 0
Dupbytes: 2011914240, 2.0 GB, 100%
Compression: 99%, 40932.5:1
Efficiency: 77.36 MB reduced/cpusec
Space: 49 KB, 853 MB total
No errors

Let's compare how fast dd can read the test VM image vs HashBackup, both with a block size of 64K.  The results below show dd reads at 233 MB/sec on this system, HashBackup reads and compares to the previous backup at 133 MB/sec (if no data changes).   The difference between the full backup time where all data is saved, and the incremental time where no data is saved, gives an idea of the time HashBackup needs to backup changed data.  Based on this, HashBackup can backup 1% of changed data for this file in about .217 seconds, or 2.17 seconds if 10% of data changed.  This is added to the incremental time with no changes to estimate the backup time with changed data.  You can use similar formulas to estimate backup times of large VM files.

IMPORTANT: these tests are with a 2010 Intel 2.5GHz Core2Duo CPU.  More recent CPUs will have faster backup times.

$ purge; dd if=../testfile of=/dev/null bs=64k
30699+1 records in
30699+1 records out
2011914240 bytes transferred in 8.638740 secs (232894413 bytes/sec)

# compute how fast HashBackup did an incremental backup and
# estimate how long it takes to backup 1% changed data

$ bc
bc 1.06
Copyright 1991-1994, 1997, 1998, 2000 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.

2011914240/15.1
133239353.642   <==  HB incremental backup rate was 133.23 MB/sec for unchanged VM image

(36.9-15.1)/100
.218            <==  approx seconds to backup 1% changed data in the test VM image

Do an incremental backup with 10% of the 2GB VM image data changed at random, to verify the expected runtime of 21.8 seconds for 10% changed data:


$ purge; time hb backup -c hb -D1g -B64K ../testfile

Backup directory: /Users/jim/hbrel/hb
Group test & size:  262144 1048576
This is backup version: 2
Dedup enabled, 13% of current, 0% of max
/Users/jim/testfile

Time: 17.2s                              <== runtime is a bit lower than expected
CPU:  23.1s, 134%
Checked: 7 paths, 2011914751 bytes, 2.0 GB
Saved: 6 paths, 2011914240 bytes, 2.0 GB
Excluded: 0
Dupbytes: 1816027136, 1.8 GB, 90%        <== shows 10% changed data
Compression: 95%, 22.2:1
Efficiency: 79.42 MB reduced/cpusec
Space: 90 MB, 964 MB total
No errors


NetApp Snapshot Backups

NetApp Filers have a built-in snapshot capability.  Snapshots can be taken automatically with generated pathnames, or manually with a specified pathname.  An example of an automatic name would be /nfs/snapshot.0, /nfs/snapshot.1, etc.  with the higher number being the most recent snapshot.  Saving the highest snapshot is a problem for HashBackup because the pathname changes on every backup, causing unreasonable metadata growth in the hb.db file.

To make efficient backups, use a bind mount on Linux or nullfs mount on FreeBSD to make the snapshot directory appear at a fixed name:

$ mount --bind /nfs/snapshot.N /nfstree  (Linux)

$ mount_nullfs /nfstree /nfs/snapshot.N  (FreeBSD)

Then backup /nfstree with HashBackup.


Dirvish Backups

Dirvish is an open-source backup tool that uses rsync to create hard-linked backup trees.  Each tree is a complete snapshot of a backup source, but unchanged files are actually hard links.  This saves disk space since file data is not replicated in each tree.  The problem is that these trees often include a timestamp in the pathname.  If the trees are backed up directly, every pathname is unique.  This causes unreasonable metadata growth in the hb.db file, that leads to a large memory cache and excess RAM usage.

For more efficient backups, mount the tree directory (the directory containing user files) to a fixed pathname with a Linux bind mount or FreeBSD nullfs mount.  See the NetApp section for details.

A second issue with hard-link backups is that HashBackup maintains an in-memory data structure for hard link inodes so that it can properly relate linked files.  This data structure is not very memory efficient and uses ~165 bytes for each inode that is a hard link.  This can be disabled with the --no-ino option, though that disables all hard-link processing.


Rsnapshot Backups

Rsnapshot is an rsync-based backup tool that creates hard-linked backup trees.  A big difference is that with rsnapshot, the most recent backup is contained in the .0 directory, ie, the most recent backup is at a fixed mount point.  HashBackup can be used to backup this .0 directory every night.  You should avoid backing up the other rsnapshot directories.  If you need to save the other directories:

- rename backup.0 to backup.0.save
- rename the rsnapshot directory you want to backup as backup.0
- save backup.0
- rename the directory back to it's original name
- repeat these steps for every rsnapshot directory you want to backup
- when finished, rename backup.0.save to backup.0
- going forward, only save backup.0 with HashBackup
- instead of renaming you can use the bind mount trick (see the NetApp section)

Since rsnapshot backups are hard-link trees, nearly every file is a hard link and you may want to consider using --no-ino to lower memory usage.  This disables all hard-link processing in HashBackup.
Comments