Selftest

Checks backup data to ensure that it is consistent.

$ hb selftest [-c backupdir] [-v level] [-r rev] [--sample n] [--inc incspec]
              [--debug] [--fix] [path, arc names, block ids] ...

This does not compare backup data to its original source in the filesystem, but instead does an integrity check of the backup data. For example, selftest can be used after a backup directory is copied to another file system or recovered from an offsite server to ensure that the transfer was successful and all backup data is intact. Selftest is commonly used after each backup with -v2 (check database only) or with -v4 and --inc to incrementally download and check backup data over a long period of time.

Several levels of progressively intense testing can be performed with -v:

-v0 check that the main database is readable
-v1 check low-level database integrity (B-tree structure, etc.)
-v2 check HB data structures within the database (default level)
-v3 check local archive files
-v4 check local + remote archive files
-v5 verify user file hashes by simulating a restore of each file

If cache-size-limit is set, some archive files may not be stored locally and -v2 is a useful selftest level. It will test the database without downloading any files.

Recommended -v Levels

  • -v2 is a quick check of the main database. It does not read archive files, so there is much less disk activity and no downloading is required. This is the default check level.

  • -v3 is like -v2 but also checks local archive files. It does not download any data. All local archive data blocks are read, decrypted, decompressed, and a SHA1 hash computed. This SHA1 is compared with the one stored when the backup was created.

  • -v4 is like -v3, but downloads all remote archive files and checks them too. If any blocks are bad, a good copy from another destination will be substituted if available. This is the most practical data verification level because each block of backup data is only verified once.

  • -v5 simulates a restore of user files by calculating a file hash for each user file and comparing it to the hash stored during the backup. This can be time-consuming since every version of every file is decrypted, decompressed, and the hashes for all blocks and all files are re-computed and verified. Backups with a lot of dedup will take longer to verify because each dedup block has to be processed every time it is referenced to generate and check the file hash.

Partial Checking

-r rev checks a specific backup version

arc name …​ checks specific arc files like arc.0.0 with -v2 to -v4

pathname …​ checks specific files and directories with either -v2 or -v5

block id …​ looks up a block id and tests the arc file containing that block

--inc incspec incremental checking that may span a long time period

--sample n tests n random samples from arc files with -v3 or -v4

Incremental checking, for example, --inc 1d/30d, means that selftest is run every day (the 1d part), perhaps via a cron job, and the entire backup should be checked over a 30-day period (the 30d part). Each day, 1/30th of the backup is checked. The -v3, -v4, and -v5 options control the check level, and each level has its own incremental check schedule. For huge backups, it may be necessary to spread checking out over a quarter or even longer. The schedule can be changed at any time by changing the time specification.

You may also specify a download limit with -v4 incremental checking. For example, --inc 1d/30d,500MB means to limit the check to 500MB of downloaded backup data. Many storage services have free allowances for downloads but charge for going over the limit. Adding a download limit ensures that incremental selftest doesn’t go over the free allowance. The download limit is always honored, even if it causes a complete cycle to take longer than specified.

The --sample n option will download and test n random samples from each arc file instead of downloading and testing the entire arc file. This can be used with -v3 or -v4. It is useful with large backups that would take too long or be too costly to download completely. This is a better test than the dest verify command because some data is actually downloaded and verified with a strong hash to make sure it is available. Sampling only works on destinations that support selective downloads. If the --sample option is used on a backup that has destinations that don’t support selective downloads, like rsync, those destinations are not sampled. A reasonable sample size is 1-5, though any sample size is supported. If the sample size is 2 or more, the last block is always sampled to catch truncation errors. The --sample option cannot be used with --fix. If a problem is detected in an arc file with --sample, run selftest again with --fix and the include the bad arc file name on the command line. This will do a full selftest and correction of the file listed.

Debug and Corrections

The --debug option will display the logid and pathname for each file as it is verified. The logid is an internal identification for a unique file and version number. This option is not normally used but may be requested for technical problems.

The --fix option can correct some errors that selftest detects, but not all. Errors will not normally occur, but there have been times where a bug or other problem has caused a selftest error; using this option may fix it. Corrections occur with -v2 or higher.

Selftest errors almost always indicate some kind of program bug, so rather than using the --fix option immediately, it’s best to send a tech support email with the selftest output. Selftest errors are sometimes caused by a bug in selftest, so it could be that there is nothing wrong with the backup. Or, the error could be fixable with a new release rather than letting selftest --fix delete files with errors.

Errors

The selftest command is designed to give a high level of confidence that the backup data is correct. If selftest errors occur, it could be because of a software defect, an unanticipated usage of the software, or in unusual cases, a hardware problem. Problems with software are usually easy to correct in a new software release. Problems with hardware are another matter. Because HashBackup runs nearly all of its data through secure checksums, it can often verify data correctness, whereas most programs do not. For example, if you copy a file with the Unix cp command, the source file is read into memory then written to a new file. If a memory parity error occurs during the copy and a bit is flipped, the new copy will not be identical, but there is no checking to detect this error and no error will occur unless you have error-correcting (ECC) memory.

HashBackup has been used since 2010 to backup a Mac Mini Server used for development. There are several VMs running on this machine for testing, as well as 7 copies of a Prime minicomputer emulator. In 2012, a selftest showed these errors:

Verifying file 484900 - 55 GB verified`
Error: unable to get block 10444925 for logid 527943: /Users/jim/Documents/Virtual Machines.localized/bsd732.vmwarevm/bsd732-s004.vmdk version 11: getblock: blockid 10444925 hash mismatch`
Verifying file 485000 - 65 GB verified
...
Verifying blocks
Error: extra blockid 0 in dedup table
Error: dedup blockid 10444925 hash mismatch
Verified blocks with 2 errors`
3 errors

After a lot of head scratching and debugging, the culprit was found: marginally bad RAM. A bit had been flipped in memory (the same bit), two different times, causing these 3 errors. It seemed incredibly unlikely since this machine had been running fine for years, yet everything pointed to this as the cause.

A little research turned up a Google paper on the subject of memory errors: DRAM Errors in the Wild: A Large-Scale Field Study. This paper describes data analysis of memory errors in Google’s server farm, and in the conclusions, states:

Conclusion 1: We found the incidence of memory errors and the range of error rates across different DIMMs to be much higher than previously reported. About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year.
— Google

On server-class hardware using ECC (error correcting) memory, a parity error, especially correctible, is not so bad. On typical business-class hardware, which often does not even support ECC, a memory error means a bit getting flipped. If you are lucky, the machine will crash. In this case, it corrupted a backup. Because of redundancy in HashBackup, it was possible to determine the bit that was flipped, correct it, and selftest was fine again. Even if it wasn’t possible to find the error, it was isolated to one block of one file, so removing that file would have been an alternate solution. The RAM was replaced and no errors like this have occurred again.

After reading this paper, it seems that hard drive "silent corruption" or "bit rot" problems, where bad data is returned without I/O errors, are more likely RAM errors. There are ECC codes on every hard drive sector, so it seems unlikely that bits could get changed without triggering an I/O error. This example shows that if the bit is flipped in RAM, before it ever reaches the hard drive (write), or after it is read from a hard drive, the end result appears to be "silent disk corruption", though the real culprit is bad RAM.

Another research paper from the University of Wisconsin (link downloads a PDF) details disk and memory fault injection with ZFS, a filesystem that is designed to handle failure. The conclusion is that ZFS handles disk faults well, but like most software, does not work well in the presence of memory faults.

Verifying Backups With Selftest and Mount

There are several ways to verify your backup and each has trade-offs. Here are three common ways:

1. selftest -v4

  • if cache-size-limit is set:

    • doesn’t prefetch arc files (today, but could in the future)

    • downloads and deletes 1 arc file at a time

    • doesn’t use much cache space

  • checks that all data is stored correctly on the remote

  • you trust HB to put the file together correctly at restore time

  • faster than -v5

2. selftest -v5

  • if cache-size-limit is set:

    • prefetches arc files

    • has to download and cache many arc files because of dedup

    • deletes arc files from cache when no longer needed for selftest

    • could use a lot of cache space, but less than mount

  • checks that all data is stored correctly on the remote

  • reconstructs original file and verifies SHA1 hash for each file

  • slower than -v4 because every file of every backup is reconstructed

3. mount + diff or rsync -a:

  • if cache-size-limit is set:

    • doesn’t prefetch

    • keeps every arc file downloaded until unmount

    • uses the most cache space if every file is referenced

  • checks that all data is stored correctly on the remote if every data block is accessed

  • have to compare mount with the original to verify files

  • slower than -v4 and -v5

  • easy to verify just a few files