Get

Restores files and directories from a backup.

$ hb get [-c backupdir] [-r version]  [-v level]
         [--no-ino] [--orig] [--delete] [-i] [-no-sd]
         [--todev blockdev] [--cache cachedir[,lock]]
         [--plan] [--no-local] [--no-mtime] [--splice]
         pathnames

Relative paths and simple filenames are changed to full paths using the current directory.

By default, the get command restores the last component of the pathname to the current directory. For example, if /etc/rc.conf is restored, a new rc.conf file is created in the current directory. If /etc is restored, a new etc directory will be created in the current directory. For extra safety, consider cd /tmp before a restore.

--orig restores files to their original locations instead of the current directory.

Get always asks before a file or directory is restored over an existing file or directory. For safety, get may sometimes refuse to do a restore over existing data if it believes the restore may not be what you really want. For example, it will not restore a file over an existing non-empty directory with the same name. This can be avoided by restoring to a different location, or removing or renaming the existing file or directory before the restore.

When restoring over existing directories, get normally merges the backup contents with the directory contents. If the existing files match the backup files, they are left alone (unless --no-local is used). Otherwise, they are overwritten with the backup file. The --delete option deletes existing files and directories that are not in the backup. This is similar to the rsync --delete option and can be used to "sync" a directory to the backup contents.

The -r option selects a specific version to restore. This will restore files that are backed up at this version or an earlier version as necessary. With -r1, some of the files may have been backed up in version zero and some backed up in version 1; the get command will automatically select the latest version for each file. If -r is not used, the get command restores files from their most recent backup.

Using Local Files

When possible, get will use local files to assist the restore. This is sometimes called incremental restore. It looks in both the original backup directory and the target restore directory for files with the same name, mtime (modification time) and size to select matches. This makes restarted restores very fast since data already downloaded and restored does not have to be restored a 2nd time.

For a more strict verification, the --no-mtime option can be used. This computes a strong cryptographic hash of the local file and ensures it matches the hash of the backup file before using the local file for the restore.

To disable local data use the --no-local option. This is a useful option when doing test restores immediately following a backup to verify remote backup data.

When local files have been modified, it is still possible to use some local data combined with some remote data to restore files using the --splice option. This requires reading through local files and uses more temp space in the backup directory, but can save significant download bandwidth. During splicing, a strong file hash is computed for the restored file and compared to the original backup file’s strong hash to ensure it was correctly restored.

The -v option controls how much output is displayed:

-v0 = don’t print filenames as they are restored
-v1 = print some basic headings only
-v2 = print filenames as they are restored
-v3 = print local filenames not used in the restore
-v4 = print local files used in the restore

The -i option asks before starting the restore. This is useful for comparing restore plans.

The --no-sd option disables selective download, which is what HashBackup uses to download parts of files. This option might speed up large restores when cache-size-limit is set because the planning stage is faster and uses less RAM. More data is downloaded because whole arc files are downloaded instead of just the pieces needed, but fewer requests are issued to remote servers. This option might be useful on high-bandwidth and/or high-latency networks, or when RAM is limited.

Cache Directory

When the config option cache-size-limit is set, some arc (backup data) files are stored locally in the backup directory and some are on remote destinations and have to be downloaded to the backup directory during restore. This poses challenges:

  1. the backup directory may be on a small disk, like an SSD, and might not have room to download the data needed to do a large restore.

  2. during a restore, the backup directory has to be locked since the cache of arc files might be changing. This lock prevents backup during restore and keeps two different restores from running simultaneously.

The --cache option solves these problems. It specifies a separate directory used for storing downloaded data needed during the restore. If the cache directory already exists and matches the backup directory version, some downloaded data can be reused and the cache directory will remain after the get command. If the cache directory doesn’t exist when get starts, it will be created and then deleted when get finishes. So to use the cache directory for several restores, create it before using get.

The --cache option has an optional modifier ,lock that can be added to request locking the main backup directory. This allows using arc files already in the backup directory instead of downloading them again. If ,lock is not used, backup directory arc files can only be reused if the cache directory is on the same disk as the backup directory, allowing hard linking.

Restore Integrity

During backup, HashBackup stores a SHA1 checksum for each block of data within a file and a separate SHA1 checksum for the entire file. During a restore, all of these checksums are verified to ensure that the data restored is identical to the data that was backed up.

HashBackup handles deleted files correctly: if a directory contains files A and B in backup versions 0 and 1, and file B is deleted and a backup occurs (version 2), then restoring the directory with -r0 or -r1 will restore both A and B, but restoring with -r2 will only restore file A. If you try to restore file B with -r2, an error message is displayed indicating that the file was deleted in version 2. You can restore the deleted file B by using the -r0 or -r1 options.

Unstable Inode Numbers and --no-ino

The --no-ino option is used to backup and restore filesystems like CIFS, UnRAID, and sshfs (FUSE in general) that do not have stable inode numbers. In typical Unix filesystems, every unique file has a unique inode number. But with these filesystems, the inode numbers are re-used, so two different files can end up with the same inode number. This is a problem because during restores, HashBackup hard links files with the same inode number. To prevent incorrect hardlinking, use the --no-ino option for both backups and restores of these types of filesystems. After the initial backup, HashBackup will stop with an error if it sees that file inode numbers are not stable and --no-ino was not used.

Restore Plans and Limited Cache

When local backup directory disk space is limited, the cache-size-limit config option specifies how much backup data (arc files) can remain in the backup directory. For restores with a limited cache, HashBackup has to create a restore plan to decide what data to download, the download order, and how long blocks have to stay in the cache during the restore. For a very large restore with a lot of small blocks, this restore planning can take some time. For example, a 1TB VM image will contain over 240M blocks if saved at the default 4K block size. Restore plans are saved in the backup directory so that if the restore is restarted for some reason, the plan can be reused.

To speed up restores in a disaster recovery situation, it is possible to use the --plan option to generate restore plans before they are needed. This can be done for one or several different restores. It is important to remember that each list of filenames to restore in one get command creates a unique restore plan. So if restore plans are created for restoring files f1 and f2 separately, a different restore plan is created for restoring f1 and f2 together in one restore. If the backup database is changed in any way, such as another backup occurs, all restore plans are invalidated. The typical way to use --plan would be as part of the daily backup procedure.

Restore plans are only saved if the --no-local option is used, because checking for modified local files has to be done just before every restore. The --plan option can be use with and without --no-local to see how this affects the amount of data downloaded.

Raw Block Device Restores

When a raw block device is restored, the data can be stored in 3 locations:

  1. --orig = restore the data to its original block device (must be unmounted)

  2. --todev = restore the data to a different, unmounted block device

  3. neither option means restore the data to a file in the current directory, using the device name as filename

Some systems (devmapper on Linux) use symbolic links to point to the actual block device. When you backup the symlink, the actual block device is included in the backup. For example: `

# hb backup -c hb /dev/mapper/ub1264-swap_1
HashBackup build 971 Copyright 2009-2013 HashBackup, LLC
Backup directory: /home/jim/hb
Adding symlink target to backup: /dev/mapper/ub1264-swap_1 -> /dev/mapper/../dm-1
This is backup version: 0
Dedup is not enabled
/dev/dm-1
/dev/mapper/ub1264-swap_1

If you restore the symlink, get tries to do something reasonable. If the symlink already exists but points to a different device, get will restore to that device, but asks for confirmation. The safest option is to use --todev and restore directly to the block device itself, /dev/dm-1 in this example, not the symlink. Then you know exactly where your restored data is written.

Restore Performance

Good restore performance is important for any backup system. Below are a few different restores from an actual backup to give an idea of HashBackup’s restore performance. The backup and restores are on a 2.66GHz Intel Core 2 Duo (2010) Mac Mini OSX server with 2x750GB hard drives. The system has been backed up daily since October 2010 and is the HashBackup build server. There are 5 virtual machines running, with disk images from 3-10GB each, 6 other virtual machines, and the usual OSX files. One hard drive is the root drive, the other stores the backups. The root drive has around 81GB of data being backed up daily. Backup for this system uses ~350MB of RAM, 2/3rds of that for the dedup table. Restores typically use 100-150MB of RAM, independent of the file or backup size, or the size of the dedup table (it is not used during restore).

Important note: these are not restores from a traditional "full" backup, or a version 0 backup where everything was just saved. These restores are from 1746 incremental, deduped backups saved over a 6 year period.

Test 1: restore a complete copy of the OSX root drive from 1746 incremental backups for 6 years

sh-3.2# time hb get -c /backup / -v1
HashBackup build #1496 Copyright 2009-2016 HashBackup, LLC
Backup directory: /backup
Most recent backup version: 1746
Restoring most recent version
Using destinations in dest.conf
Begin restore

Restoring / to /test
Restore? yes
Restored / to /test
No errors

real    75m55.703s
user    51m20.498s
sys      7m57.109s

sh-3.2# du -ksc /test
81719612    /test
81719612    total

sh-3.2# find .|wc
578956  743762 57345665

In this test, 578,956 files and directories were restored, a total of 81.7GB in 76 minutes, averaging 17 MB/sec. The system was also running 11 virtual machines so was not idle during the test (it averages 80% idle), and most of these files were on the small side so there is more seek overhead. A similar test would be to do an initial backup of the root file system and then a complete restore. This could be expected to run faster because the backup data is not scattered over 1746 different backup versions (see test 4 below).

Test 2: restore a 2GB VM disk image with daily incremental backups for 5 years

sh-3.2# time hb get -c /backup /Users/.../Documents/Virtual\ Machines.localized/centos.vmwarevm/centos-s002.vmdk
HashBackup build #1496 Copyright 2009-2016 HashBackup, LLC
Backup directory: /backup
Most recent backup version: 1746
Restoring most recent version
Using destinations in dest.conf

Restoring centos-s002.vmdk to /test
/test/centos-s002.vmdk
Restored /Users/.../Documents/Virtual Machines.localized/centos.vmwarevm/centos-s002.vmdk to /test/centos-s002.vmdk
No errors

real    3m26.150s
user    1m46.945s
sys     0m11.811s

sh-3.2# ls -l centos5532-s002.vmdk
-rw-------@ 1 jim  staff  2086993920 Apr  8 02:13 centos5532-s002.vmdk

For this test, a 2GB virtual machine (VM) disk image file was restored. The file was originally saved August 2011 with daily incremental backups since then for the last 5 years. The VM stays running all the time so the disk image is backed up nearly every day. This was verified with the backup logs. This VM image is saved with a small 4K block size to maximize dedup, but that is also the most challenging to restore: the file has 509,520 backup blocks scattered over a 5 year backup period. HashBackup restored the file in 3 minutes, 26 seconds.

Test 3: restore a 498MB zip archive from a backup 3 years ago

# time hb get -c /backup /Users/.../tapes/prime.zip
HashBackup build #1496 Copyright 2009-2016 HashBackup, LLC
Backup directory: /backup
Most recent backup version: 1746
Restoring most recent version
Using destinations in dest.conf

Restoring prime.zip to /test
/test/prime.zip
Restored /Users/.../tapes/prime.zip to /test/prime.zip
No errors

real    0m14.044s
user    0m13.496s
sys     0m1.732s

sh-3.2# ls -l prime.zip
-rw-r--r--  1 jim  staff  498219045 May 19  2013 prime.zip

The 498MB zip archive was restored in 14 seconds, from backup #715 on May 19, 2013 - at an average rate of 35.6 MB/sec. This file restored at a faster rate because it was contained in a single backup version and did not change after that, unlike the VM disk image.

Test 4: backup and restore the same drive as test #1 from an initial backup

sh-3.2# time hb backup -c /data/test -v1 -D1g /
HashBackup build #1496 Copyright 2009-2016 HashBackup, LLC
Backup directory: /Volumes/HD2/test
Copied HB program to /Volumes/HD2/test/hb#1496
This is backup version: 0
Dedup enabled, 0% of current, 0% of max
Backing up: /
Mount point contents skipped: /Network/Servers
Mount point contents skipped: /dev
Mount point contents skipped: /home
Mount point contents skipped: /net

Time: 5345.8s, 1h 29m 5s
Checked: 579014 paths, 86520028440 bytes, 86 GB
Saved: 578956 paths, 86326260319 bytes, 86 GB
Excluded: 32
Dupbytes: 20540224433, 20 GB, 23%
Compression: 49%, 2.0:1
Space: 43 GB, 43 GB total
No errors

real    89m6.195s
user    100m2.367s
sys    12m2.822s

sh-3.2# time hb get -c /data/test -v1 /
HashBackup build #1496 Copyright 2009-2016 HashBackup, LLC
Backup directory: /data/test
Most recent backup version: 0
Restoring most recent version
Begin restore

Restoring / to /test
Restore? yes
Restored / to /test
No errors

real    63m49.108s
user    53m11.719s
sys    10m29.799s

This is like test #1, restoring 86GB in 578,956 files, but the restore is from a single-version backup, similar to restoring from a traditional "full" backup. The restore is faster, 64 minutes vs 76 minutes, because data is all stored in one version rather than the 1746 incremental versions in test #1. But considering the extreme backup space savings, test #1 still has good restore performance.

This initial backup used 43GB of backup space. The backup for test #1, covering a 6-year period of daily backups for the same drive, uses about 100GB of space. Backup #1 has a retention policy of -s30d12m, meaning "keep the last 30 days of backups, plus 1 monthly backup for the year". So there are around 42 versions being retained. The "forever incremental" strategy has greatly reduced the amount of backup space required, even less than just 3 full backups would need.