Commands‎ > ‎

Selftest

Checks all backup data to ensure that it is consistent. 

  $ hb selftest [-c backupdir] [-v level] [--logid n] [--debug] [--fix]

This does not compare backup data to its original source in the filesystem, but instead does a self-contained integrity check of the backup data.  Several levels of testing can be performed:

    -v0  verify the main database is readable
    -v1  check database, don't read file blocks or archives
    -v2  check database, traverse file blocks, don't read archives
    -v3  v2 + read all archive blocks once
    -v4  v3 + decrypt and decompress all data, verify all block hashes
     -v5  v4 + verify all file hashes
    -v9  This is the default testing level, equivalent to v5

For example, selftest can be used after a backup directory is copied to another file system or recovered from an offsite server to ensure that the transfer was successful and all backup data is intact.

Another use for selftest is to verify files after a backup with dedup.  With a full selftest, all files saved in each version are completely reconstructed from the backup data, a new checksum is computed, and selftest ensures it matches the checksum recorded when the files were saved.  This is like a restore of every file in your backup, without writing files to disk.  This selftest will take a while, especially for large backups, but is added reassurance that your backup is stored correctly.



Recommended -v levels

  • -v2 is useful as a quick check of the main HashBackup database.  Since it does not read archive files, there is much less disk activity required
  • -v3 is useful to verify that your archive files contain all of the data they are supposed to contain.  Archive data blocks are only read once, so there is less disk activity than a full selftest; data is not decompressed nor decrypted, so there is less CPU activity than a full selftest
  • -v4 also processes each block only once, but decompresses, decrypts, and verifies the block hash for each block
  • -v9 (or -v omitted) is a full selftest that can be used to verify your backup either periodically, after an unusual event, or after an especially important backup.  This can be quite time-consuming since every version of every file is decrypted, decompressed, and the checksums for all blocks and all files are re-computed and verified.  Backups with a lot of dedup will take longer to verify because each dedup block has to be processed once for every time it is referenced.

Debug and Corrections

The --debug option will display the logid and pathname for each file as it is verified.  The logid is an internal identification for a unique file and version number.  This option is not normally used.

The --logid option is used to run selftest on a single file.  This option is not normally used.

The --fix option will correct minor errors that selftest may detect.  These errors will not normally occur, but there have been times where an unusual bug has caused a selftest error; using this option may fix it.  Corrections occur with -v2 or higher.  With this option, selftest will not stop after 100 errors as it usually does.


Errors

The selftest command is designed to give a high level of confidence that the backup data is correct.  If selftest errors occur, it could be because of a software defect, an unanticipated usage of the software (also a software defect), or in unusual cases, a hardware problem.  Problems with software are usually easy to correct in a new software release.  Problems with hardware are another matter.  Because HashBackup runs nearly all of its data through secure checksums, it can often verify data correctness; most programs do not.  For example, if you copy a file with the cp command, the source file is read into memory then written to a new file.  If a memory parity error occurs and a bit is flipped, the new copy will not be identical, but there is no checking to detect this error and no error will occur, assuming you don't have error-correcting memory.

HashBackup has been used since 2010 to backup a Mac Mini Server used for development.  There are several VMs running on this machine for testing, as well as 7 copies of a Prime minicomputer emulator.  Recently (2012), I ran a selftest and saw these errors:

Verifying file 484900 - 55 GB verified
Error: unable to get block 10444925 for logid 527943: /Users/jim/Documents/Virtual Machines.localized/bsd732.vmwarevm/bsd732-s004.vmdk version 11: getblock: blockid 10444925 hash mismatch
Verifying file 485000 - 65 GB verified
...
Verifying blocks
Error: extra blockid 0 in dedup table
Error: dedup blockid 10444925 hash mismatch
Verified blocks with 2 errors
3 errors

After a lot of head scratching and debugging, I found the culprit: I had marginally bad RAM.  A bit had been flipped in memory, two different times, causing these 3 errors.  It seemed incredibly unlikely since this machine has been running fine for years, yet everything pointed to this as the cause.

Doing a little research, I found a paper on the subject of memory errors: DRAM Errors in the Wild: A Large-Scale Field Study.  This paper describes data analysis of memory errors in Google's server farm, and in the conclusions, states:

"Conclusion 1: We found the incidence of memory errors and the range of error rates across different DIMMs to be much higher than previously reported.
About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year.
"

On server-class hardware using ECC (error correcting) memory, a parity error, especially correctible, is not so bad.  On typical business-class hardware, which often does not even support ECC, a memory error means a bit getting flipped.  If you are lucky, it will crash the machine.  I wasn't lucky, and it corrupted my backup.  Because of some redundancy in HashBackup, I was lucky that I found the bit that was flipped, corrected it, and selftest was fine again.  Even if I hadn't been able to find the error, it was isolated to one block of one file, so removing that file would have been an alternate solution.

After reading this paper, I now believe that hard drive "silent corruption" or "bit rot" problems, where bad data is returned without I/O errors, are actually RAM errors.  There are ECC codes on every hard drive sector, so I never understood how bits could get changed without triggering an I/O error.  This example shows that if the bit is flipped in RAM, before it ever reaches the hard drive, the end result appears to be "silent disk corruption", though the real culprit is bad RAM.
Comments