Commands‎ > ‎

Selftest

Checks backup data to ensure that it is consistent. 

  $ hb selftest [-c backupdir] [-v level] [-r rev]  [--inc incspec] [path or arc names] ... 
               [--debug] [--fix]


This does not compare backup data to its original source in the filesystem, but instead does an integrity check of the backup data. 

For example, selftest can be used after a backup directory is copied to another file system or recovered from an offsite server to ensure that the transfer was successful and all backup data is intact.

Another use for selftest is to verify files after a backup with dedup.  With -v5, files are completely reconstructed from the backup data, a new checksum is computed, and selftest ensures it matches the checksum recorded when the files were saved.  This is like a restore of every file in your backup, without writing files to disk.  This selftest will take a while, especially for large backups, but is added reassurance that your backup is stored correctly.

Several levels of progressively intense testing can be performed with -v:

    -v0  check that the main database is readable
    -v1  check low-level database integrity (B-tree structure, etc.)
    -v2  check HB data structures within the database (default level)
    -v3  check local archive files
    -v4  check local + remote archive files
    -v5  verify user file hashes by simulating a restore of each file
   
If you are using cache-size-limit and not storing all archive files locally, -v2
is a useful selftest level.  It will test the database without downloading any archive files from remote destinations.


Recommended -v levels

  • -v2 is a quick check of the main HashBackup database.  Since it does not read archive files, there is much less disk activity and no downloading is required.  This is the default check level.
  • -v3 checks that data in your local archive files is correct.  All local archive data blocks are read, decrypted, decompressed, and a SHA1 hash computed.  This is compared with the SHA1 hash stored when the backup was created.
  • -v4 is like -v3, but downloads all remote archive files and checks them too.  If any blocks are bad, a good copy from another destination will be substituted if available.
  • -v5 simulates a restore of user files by calculating a file hash for each user file and comparing it to the hash stored during the backup.  This can be time-consuming since every version of every file is decrypted, decompressed, and the hashes for all blocks and all files are re-computed and verified.  Backups with a lot of dedup will take longer to verify because each dedup block has to be processed every time it is referenced to generate and check the file hash.

Partial Checking

-r rev    restricts checking to a specific backup version

arc name  an arc filename like arc.0.0 is used to check specific arc files with -v4

pathname  pathnames can be used to check specific files and directories with -v5

--inc incspec    allows incremental checking that may span a long period

Incremental checking, for example, --inc 1d/30d, means that selftest is typically run every day, perhaps via a cron job, and the entire backup should be checked over a 30-day period.  Each day, 1/30th of the backup is checked.  The -v3, -v4, and -v5 options control the check level, and each level has its own incremental check schedule.  For huge backups, it may be necessary to spread checking out over a quarter or even longer.  The schedule can be changed at any time by using a different time specification.

You may also specify a limit with incremental checking.  For example, --inc 1d/30d,500MB means to limit the check to 500MB of backup data.  This is useful with -v4 tests with cache-size-limit set.  In this case, archives may have to be downloaded.  Many storage services have free daily allowances for downloads, but charge for going over it.  Adding a limit ensures that incremental selftest doesn't go over the free allowance.  The limit is always honored, even if it causes a complete cycle to take longer than specified.


Debug and Corrections

The --debug option will display the logid and pathname for each file as it is verified.  The logid is an internal identification for a unique file and version number.  This option is not normally used.

The --fix option will correct errors that selftest may detect.  These errors will not normally occur, but there have been times where an bug or other problem has caused a selftest error; using this option may fix it.  Corrections occur with -v2 or higher.


Errors

The selftest command is designed to give a high level of confidence that the backup data is correct.  If selftest errors occur, it could be because of a software defect, an unanticipated usage of the software (also a software defect), or in unusual cases, a hardware problem.  Problems with software are usually easy to correct in a new software release.  Problems with hardware are another matter.  Because HashBackup runs nearly all of its data through secure checksums, it can often verify data correctness; most programs do not.  For example, if you copy a file with the cp command, the source file is read into memory then written to a new file.  If a memory parity error occurs and a bit is flipped, the new copy will not be identical, but there is no checking to detect this error and no error will occur, unless you have error-correcting (ECC) memory.

HashBackup has been used since 2010 to backup a Mac Mini Server used for development.  There are several VMs running on this machine for testing, as well as 7 copies of a Prime minicomputer emulator.  In 2012, a selftest showed these errors:

Verifying file 484900 - 55 GB verified
Error: unable to get block 10444925 for logid 527943: /Users/jim/Documents/Virtual Machines.localized/bsd732.vmwarevm/bsd732-s004.vmdk version 11: getblock: blockid 10444925 hash mismatch
Verifying file 485000 - 65 GB verified
...
Verifying blocks
Error: extra blockid 0 in dedup table
Error: dedup blockid 10444925 hash mismatch
Verified blocks with 2 errors
3 errors

After a lot of head scratching and debugging, the culprit was found: marginally bad RAM.  A bit had been flipped in memory (the same bit), two different times, causing these 3 errors.  It seemed incredibly unlikely since this machine has been running fine for years, yet everything pointed to this as the cause.

Doing a little research, turned up a paper on the subject of memory errors: DRAM Errors in the Wild: A Large-Scale Field Study.  This paper describes data analysis of memory errors in Google's server farm, and in the conclusions, states:

"Conclusion 1: We found the incidence of memory errors and the range of error rates across different DIMMs to be much higher than previously reported.
About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year.
"

On server-class hardware using ECC (error correcting) memory, a parity error, especially correctible, is not so bad.  On typical business-class hardware, which often does not even support ECC, a memory error means a bit getting flipped.  If you are lucky, the machine will crash.  In this case, it corrupted a backup.  Because of redundancy in HashBackup, it was possible to determine the bit that was flipped, correct it, and selftest was fine again.  Even if it wasn't possible to find the error, it was isolated to one block of one file, so removing that file would have been an alternate solution.

After reading this paper, it seems that hard drive "silent corruption" or "bit rot" problems, where bad data is returned without I/O errors, are more likely RAM errors.  There are ECC codes on every hard drive sector, so it seems very unlikely that bits could get changed without triggering an I/O error.  This example shows that if the bit is flipped in RAM, before it ever reaches the hard drive, the end result appears to be "silent disk corruption", though the real culprit is bad RAM.

Verifying with selftest and mount

There are several ways to verify your backup and each has trade-offs.  Here are three common ways:

1. selftest -v4:

- if cache-size-limit is set:
  - doesn't prefetch arc files (today, but could in the future)
  - downloads and deletes 1 arc file at a time
  - doesn't use much cache space
- checks that all data is stored correctly on the remote
- you trust HB to put the file together correctly
- faster than -v5


2. selftest -v5

- if cache-size-limit is set:
  - prefetches arc files
  - has to download and keep many arc files because of dedup
  - deletes arc files from cache when no longer needed for selftest
  - could use a lot of cache space, but less than mount
- checks that all data is stored correctly on the remote
- reconstructs original file and verifies SHA1 hash for each file
- slower than -v4 because every file of every backup is reconstructed

3. mount + diff or rsync -a:
- if cache-size-limit is set:
  - doesn't prefetch, and never will
  - keeps every arc file ever downloaded until unmount
  - uses the most cache space if every file is referenced
- checks that all data is stored correctly on the remote *if every data block is referenced*
- have to compare with an original to verify files, or compute your own file hash
- slower than -v4 and -v5
- easy to verify just a few files
Comments