Checks all backup data to ensure that it is consistent. -v0 verify the main database is readable$ hb selftest [-c backupdir] [-v level] [--logid n] [--debug] [--fix] This does not compare backup data to its original source in the filesystem, but instead does a self-contained integrity check of the backup data. Several levels of testing can be performed: -v1 check database, don't read file blocks or archives -v2 check database, traverse file blocks, don't read archives -v3 v2 + read all archive blocks once -v4 v3 + decrypt and decompress all data, verify all block hashes -v5 v4 + verify all file hashes -v9 This is the default testing level, equivalent to v5
For example, selftest can be used after a backup directory is copied to another file system or recovered from an offsite server to ensure that the transfer was successful and all backup data is intact. Another use for selftest is to verify files after a backup with dedup. With a full selftest, all files saved in each version are completely reconstructed from the backup data, a new checksum is computed, and selftest ensures it matches the checksum recorded when the files were saved. This is like a restore of every file in your backup, without writing files to disk. This selftest will take a while, especially for large backups, but is added reassurance that your backup is stored correctly. Recommended -v levels
Debug and Corrections The --debug option will display the logid and pathname for each file as it is verified. The logid is an internal identification for a unique file and version number. This option is not normally used. The --logid option is used to run selftest on a single file. This option is not normally used. The --fix option will correct minor errors that selftest may detect. These errors will not normally occur, but there have been times where an unusual bug has caused a selftest error; using this option may fix it. Corrections occur with -v2 or higher. With this option, selftest will not stop after 100 errors as it usually does. Errors The selftest command is designed to give a high level of confidence that the backup data is correct. If selftest errors occur, it could be because of a software defect, an unanticipated usage of the software (also a software defect), or in unusual cases, a hardware problem. Problems with software are usually easy to correct in a new software release. Problems with hardware are another matter. Because HashBackup runs nearly all of its data through secure checksums, it can often verify data correctness; most programs do not. For example, if you copy a file with the cp command, the source file is read into memory then written to a new file. If a memory parity error occurs and a bit is flipped, the new copy will not be identical, but there is no checking to detect this error and no error will occur, assuming you don't have error-correcting memory. HashBackup has been used since 2010 to backup a Mac Mini Server used for development. There are several VMs running on this machine for testing, as well as 7 copies of a Prime minicomputer emulator. Recently (2012), I ran a selftest and saw these errors: Verifying file 484900 - 55 GB verified Error: unable to get block 10444925 for logid 527943: /Users/jim/Documents/Virtual Machines.localized/bsd732.vmwarevm/bsd732-s004.vmdk version 11: getblock: blockid 10444925 hash mismatch Verifying file 485000 - 65 GB verified ... Verifying blocks Error: extra blockid 0 in dedup table Error: dedup blockid 10444925 hash mismatch Verified blocks with 2 errors 3 errors After a lot of head scratching and debugging, I found the culprit: I had marginally bad RAM. A bit had been flipped in memory, two different times, causing these 3 errors. It seemed incredibly unlikely since this machine has been running fine for years, yet everything pointed to this as the cause. Doing a little research, I found a paper on the subject of memory errors: DRAM Errors in the Wild: A Large-Scale Field Study. This paper describes data analysis of memory errors in Google's server farm, and in the conclusions, states: "Conclusion 1: We found the incidence of memory errors and the range of error rates across different DIMMs to be much higher than previously reported. On server-class hardware using ECC (error correcting) memory, a parity error, especially correctible, is not so bad. On typical business-class hardware, which often does not even support ECC, a memory error means a bit getting flipped. If you are lucky, it will crash the machine. I wasn't lucky, and it corrupted my backup. Because of some redundancy in HashBackup, I was lucky that I found the bit that was flipped, corrected it, and selftest was fine again. Even if I hadn't been able to find the error, it was isolated to one block of one file, so removing that file would have been an alternate solution.About a third of machines and over 8% of DIMMs in our fleet saw at least one correctable error per year." After reading this paper, I now believe that hard drive "silent corruption" or "bit rot" problems, where bad data is returned without I/O errors, are actually RAM errors. There are ECC codes on every hard drive sector, so I never understood how bits could get changed without triggering an I/O error. This example shows that if the bit is flipped in RAM, before it ever reaches the hard drive, the end result appears to be "silent disk corruption", though the real culprit is bad RAM. |