Commands‎ > ‎

Selftest

Checks backup data to ensure that it is consistent. 

  $ hb selftest [-c backupdir] [-v level] [-r rev] [--sample n] [--inc incspec] [path or arc names] ... 
               [--debug] [--fix]


This does not compare backup data to its original source in the filesystem, but instead does an integrity check of the backup data. 

For example, selftest can be used after a backup directory is copied to another file system or recovered from an offsite server to ensure that the transfer was successful and all backup data is intact.

Another use for selftest is to verify files after a backup with dedup.  With -v5, files are completely reconstructed from the backup data, a new checksum is computed, and selftest ensures it matches the checksum recorded when the files were saved.  This is like a restore of every file in your backup, without writing files to disk.  This selftest will take a while, especially for large backups, but is added reassurance that your backup is stored correctly.

Several levels of progressively intense testing can be performed with -v:

    -v0  check that the main database is readable
    -v1  check low-level database integrity (B-tree structure, etc.)
    -v2  check HB data structures within the database (default level)
    -v3  check local archive files
    -v4  check local + remote archive files
    -v5  verify user file hashes by simulating a restore of each file
   
If cache-size-limit is set, some archive files may not be stored locally and -v2
is a useful selftest level.  It will test the database without downloading any files.


Recommended -v levels

  • -v2 is a quick check of the main database.  It does not read archive files, so there is much less disk activity and no downloading is required.  This is the default check level.
  • -v3 checks local archive files.  All local archive data blocks are read, decrypted, decompressed, and a SHA1 hash computed.  This SHA1 is compared with the one stored when the backup was created.
  • -v4 is like -v3, but downloads all remote archive files and checks them too.  If any blocks are bad, a good copy from another destination will be substituted if available.
  • -v5 simulates a restore of user files by calculating a file hash for each user file and comparing it to the hash stored during the backup.  This can be time-consuming since every version of every file is decrypted, decompressed, and the hashes for all blocks and all files are re-computed and verified.  Backups with a lot of dedup will take longer to verify because each dedup block has to be processed every time it is referenced to generate and check the file hash.

Partial Checking

-r rev    checks a specific backup version

arc name   checks specific arc files like arc.0.0 with -v2 to -v4

pathname  checks specific files and directories with -v2 or -v5

--inc incspec incremental checking that may span a long period

--sample n    tests n random samples from arc files with -v3 or -v4

Incremental checking, for example, --inc 1d/30d, means that selftest is run every day, perhaps via a cron job, and the entire backup should be checked over a 30-day period.  Each day, 1/30th of the backup is checked.  The -v3, -v4, and -v5 options control the check level, and each level has its own incremental check schedule.  For huge backups, it may be necessary to spread checking out over a quarter or even longer.  The schedule can be changed at any time by using a different time specification.

You may also specify a download limit with incremental checking.  For example, --inc 1d/30d,500MB means to limit the check to 500MB of backup data.  This is useful with -v4 when cache-size-limit set.  In this case, archives may have to be downloaded.  Many storage services have free daily allowances for downloads, but charge for going over it.  Adding a download limit ensures that incremental selftest doesn't go over the free allowance.  The download limit is always honored, even if it causes a complete cycle to take longer than specified.

The --sample n  option will download and test n random samples from each arc file instead of downloading and testing the entire arc file.  This can be used with -v3 or -v4.  It is useful with large backups that would take too long or be too costly to download completely.  This is a better test than the dest verify command because some data is actually downloaded and verified with a strong hash to make sure it is available.  Sampling only works on destinations that support selective downloads: B2, S3 and compatibles, WebDAV, and FTP.  If the --sample option is used on a backup that has destinations that don't support selective downloads, like rsync, those destinations not sampled.  A reasonable sample size is 1-5, though any sample size is supported.  The --sample option cannot be used with --fix.  If a problem is detected in an arc file with --sample, run selftest again with --fix and the arc file name on the command line.  This will do a full selftest and correction of the file listed.

Debug and Corrections

The --debug option will display the logid and pathname for each file as it is verified.  The logid is an internal identification for a unique file and version number.  This option is not normally used.

The --fix option can correct some errors that selftest may detect.  These errors will not normally occur, but there have been times where an bug or other problem has caused a selftest error; using this option may fix it.  Corrections occur with -v2 or higher.  

IMPORTANT: Selftest errors almost always indicate some kind of program bug, so rather than using the --fix option immediately, it's best to send a tech support email with the selftest output.  Selftest errors are sometimes caused by a bug in selftest, so it could be that there is nothing wrong with the backup.  Or, the error could be fixable with a new release rather than letting selftest --fix delete files with errors.


Errors

The selftest command is designed to give a high level of confidence that the backup data is correct.  If selftest errors occur, it could be because of a software defect, an unanticipated usage of the software (also a software defect), or in unusual cases, a hardware problem.  Problems with software are usually easy to correct in a new software release.  Problems with hardware are another matter.  Because HashBackup runs nearly all of its data through secure checksums, it can often verify data correctness, whereas most programs do not.  For example, if you copy a file with the cp command, the source file is read into memory then written to a new file.  If a memory parity error occurs and a bit is flipped, the new copy will not be identical, but there is no checking to detect this error and no error will occur unless you have error-correcting (ECC) memory.

HashBackup has been used since 2010 to backup a Mac Mini Server used for development.  There are several VMs running on this machine for testing, as well as 7 copies of a Prime minicomputer emulator.  In 2012, a selftest showed these errors:

Verifying file 484900 - 55 GB verified
Error: unable to get block 10444925 for logid 527943: /Users/jim/Documents/Virtual Machines.localized/bsd732.vmwarevm/bsd732-s004.vmdk version 11: getblock: blockid 10444925 hash mismatch
Verifying file 485000 - 65 GB verified
...
Verifying blocks
Error: extra blockid 0 in dedup table
Error: dedup blockid 10444925 hash mismatch
Verified blocks with 2 errors
3 errors

After a lot of head scratching and debugging, the culprit was found: marginally bad RAM.  A bit had been flipped in memory (the same bit), two different times, causing these 3 errors.  It seemed incredibly unlikely since this machine has been running fine for years, yet everything pointed to this as the cause.

A little research turned up a paper on the subject of memory errors: DRAM Errors in the Wild: A Large-Scale Field Study.  This paper describes data analysis of memory errors in Google's server farm, and in the conclusions, states:

"Conclusion 1: We found the incidence of memory errors and the range of error rates across different DIMMs to be much higher than previously reported.
About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year.
"

On server-class hardware using ECC (error correcting) memory, a parity error, especially correctible, is not so bad.  On typical business-class hardware, which often does not even support ECC, a memory error means a bit getting flipped.  If you are lucky, the machine will crash.  In this case, it corrupted a backup.  Because of redundancy in HashBackup, it was possible to determine the bit that was flipped, correct it, and selftest was fine again.  Even if it wasn't possible to find the error, it was isolated to one block of one file, so removing that file would have been an alternate solution.  The RAM was replaced and no errors like this have occurred again.

After reading this paper, it seems that hard drive "silent corruption" or "bit rot" problems, where bad data is returned without I/O errors, are more likely RAM errors.  There are ECC codes on every hard drive sector, so it seems very unlikely that bits could get changed without triggering an I/O error.  This example shows that if the bit is flipped in RAM, before it ever reaches the hard drive, the end result appears to be "silent disk corruption", though the real culprit is bad RAM.

Verifying Backups With Selftest and Mount

There are several ways to verify your backup and each has trade-offs.  Here are three common ways:

1. selftest -v4:

- if cache-size-limit is set:
  - doesn't prefetch arc files (today, but could in the future)
  - downloads and deletes 1 arc file at a time
  - doesn't use much cache space
- checks that all data is stored correctly on the remote
- you trust HB to put the file together correctly at restore time
- faster than -v5


2. selftest -v5

- if cache-size-limit is set:
  - prefetches arc files
  - has to download and cache many arc files because of dedup
  - deletes arc files from cache when no longer needed for selftest
  - could use a lot of cache space, but less than mount
- checks that all data is stored correctly on the remote
- reconstructs original file and verifies SHA1 hash for each file
- slower than -v4 because every file of every backup is reconstructed

3. mount + diff or rsync -a:
- if cache-size-limit is set:
  - doesn't prefetch
  - keeps every arc file downloaded until unmount
  - uses the most cache space if every file is referenced
- checks that all data is stored correctly on the remote if every data block is accessed
- have to compare mount with the original to verify files
- slower than -v4 and -v5
- easy to verify just a few files
Comments