Security Details

HashBackup (HB) is a Unix backup program with these design goals:

  • easy to use

  • accurate backups and restores

  • space-efficient storage of backup data

  • bandwidth-efficient network transfers

  • local and/or remote backup storage

  • local backup storage for fast restores

  • remote backup storage for disaster recovery

  • remote storage using common protocols: ftp, ssh, rsync, S3, imap, etc.

  • backup data sent directly to user’s remote storage

  • ability to use "dumb"/passive remote storage

  • use of untrusted remote storage with client-side encryption and private keys

  • authentication of restored data

This note was originally written for a paid security assessment of HashBackup and outlines in detail the security procedures in HB, with the goal of others being able to do their own security assessment.

Edit History

2021-12-01: periodic review, format for new site, minor edits
2020-06-04: periodic review, minor edits for clarity
2017-04-14: periodic review, minor changes for clarity
2016-03-10: updated text to match previously listed changes
2016-01-05: whole file hash changed to SHA1
2014-09-30: encrypted dest.db is copied directly to remotes
2013-10-24: Since Jan 2013, dest.conf can be stored in hb.db
2012-12-13: use /dev/random + /dev/urandom for keys because of
long hangs on Linux VMs
2012-11-08: removed public CRC from arc files
2012-09-05: edits before release
2012-05-24: misc typos

Overview

HB reads a file system, or selected files, and saves file data, file metadata, and backup metadata to a local backup directory. Data is stored primarily in two places: an encrypted database named hb.db stores metadata, and archive files store encrypted file data.

During backups, HB breaks files into blocks, suppresses redundant blocks, compresses and encrypts blocks, and stores blocks in arc files. Archive files are named arc.v.n, where v is the backup number (version) and n is a sequence number within the backup. Archive files are 100MB by default, but this is configurable. So a backup of 500MB will create arc.0.0 - arc.0.4, each ~100MB. The next backup creates arc.1.0, etc. If remote destinations are configured in dest.conf (a text file), arc files are sent offsite to all remotes during the backup. Local arc files may be deleted during the backup if cache-size-limit is set, the cache becomes full, and the arc files have been sent to all destinations.

File and backup metadata is stored in the hb.db database. File metadata includes Unix permission bits, ACLs, file dates and so on. Backup metadata includes cryptographic hashes for files & blocks, file & block relationships, and block & archive file relationships.

hb.db accumulates historical information for all backups: it knows which blocks are needed to reconstruct version 0 of a file, and which blocks are needed to reconstruct the most recent version. After a backup, the local backup directory has an updated hb.db database and incremental update files are sent to all remote destinations with changes to hb.db.

There are other utilities to remove files from the backup, list files, perform retention, mount a backup as a Unix filesystem, etc.

Security Overview

The security goal of HashBackup is secure remote storage of backup data to untrusted storage providers. Securing local backup data is not a primary goal: if someone has access to a computer, there are many avenues to monitor and attack it, from logging keystrokes to intercepting shared library calls, installing MITM device drivers, and so on.

Some data in HB’s local backup directory stays local, for example, the dedup table and the key in key.conf. Other data may be sent offsite, with the goal of being able to reconstruct the backup in a disaster situation: the computer is stolen, fire, flood, or the hard disk containing the local backup crashes.

The security goals then are to:

  • send encrypted backup data to remote storage

  • prevent original data access via remote storage without the key

  • retrieve the backup data

  • ensure the data was not modified while on remote storage

  • ensure that restored files are identical to the originals

Different sites will have different security needs. For example, "remote" storage may be a company-owned FTP server connected over a LAN. Insecure protocols like FTP are supported because the convenience they provide may be more important to a particular customer than the protocol’s security issues. HashBackup supports secure transport methods, but also supports insecure but convenient transport methods.

Remote Data

There are several types of data sent to remote storage:

  1. hb.db, an encrypted database, must be sent since it contains all metadata. It is potentially a large file. To avoid sending the whole file after every backup, hb.db is sent offsite as incrementals. The recover command downloads these incrementals to regenerate the original hb.db file if the local backup directory is lost, ie, a disaster recovery situation.

  2. arc files (encrypted, compressed file data blocks) are sent offsite

  3. an encrypted database, dest.db, maintains lists of HB’s files that are stored on each destination. Its purpose is to avoid needing to "list" the files on each remote, because sometimes this is slow, difficult, or expensive. When a file is sent to or deleted from a remote, an entry is made in dest.db. dest.db is sent to every remote after each backup. If the local backup directory is lost, dest.db is fetched first, to tell HashBackup which other files need to be fetched and where they are located (which remote has them).

Important Local Files

Two files are never sent offsite, but contain important information that is needed to recover the backup directory from a remote site:

key.conf contains the main key. hb init creates this key, by default using the system random number generators (/dev/random & /dev/urandom) to get a 256-bit key. This is then used to derive the rest of the keys. The key.conf file is never sent offsite. The key may be protected with a passphrase. The permissions are set to read-only for the user running init when the key file is created.

dest.conf is a text file containing remote configuration information. For example, connecting to a remote FTP server requires a host name, port number, user id, and password. Each remote destination will have a section in dest.conf with parameters needed to connect to the remote. dest.conf is not encrypted, so remote account passwords and access tokens are in the clear. It is not possible to use hash codes for passwords in dest.conf, because passwords must be supplied to remote services for validation. Users are encouraged to put very restrictive system access permissions on dest.conf.

For more security, the "hb dest load" command can be used to store the dest.conf file into the encrypted hb.db file, and then the dest.conf file is deleted. For best security, hb.db should also be protected with an admin passphrase and/or main key passphrase to prevent a local user from recovering remote credentials.

Copies of the key.conf and dest.conf text files must be stored separately and/or printed, to be used for disaster recovery (loss of the local backup directory).

Random Numbers

HB uses random numbers to generate:

  • keys

  • AES initialization vectors (IV)

  • AES-CBC padding

To generate keys, HB first tries to read as much of the key as possible from /dev/random, with a non-blocking read. If this read is unable to return a complete key because the entropy pool is depleted, the rest of the key is read from /dev/urandom. While /dev/random is preferred for key generation, it often blocks for very long times on Linux VMs, especially right after startup, to the point of being unusable. On BSD machines, /dev/random and /dev/urandom are usually equivalent and never block.

The other use of random data is for AES IVs and CBC padding. Up to 32 bytes may be required for each backup data block. This can require a significant amount of random data during backup operations, and might deplete the system’s entropy pools. /dev/urandom is a non-blocking version of /dev/random, but using it also might deplete the system’s entropy pools during backups, especially on Linux VMs.

To prevent system entropy pool depletion, HB uses AES-128 in OFB mode to generate cryptographically secure random numbers. The RNG is seeded with 48 bytes of data from /dev/urandom:

  • 16 bytes for the AES-128 key

  • 16 bytes for the AES-128 IV

  • 16 bytes for the plaintext data to encrypt

References:

Using AES as a CSRNG: Empirical Evidence Concerning AES Peter Hellekalek, Stefan Wegenkittl See section 4, Findings, first 3 paragraphs (RNG mode) https://dl.acm.org/doi/pdf/10.1145/945511.945515

Procedures Walkthrough

The main sequence of events to use HB is:

  1. create a backup directory: hb init -c backupdir; only once

  2. make a series of backups: hb backup -c backupdir /home

  3. do file retensions, remove files, list, etc (utilities) as needed

  4. rekey if needed

  5. restore files

  6. recover the backup from a remote (disaster recovery)

Init Walkthrough

For init command hb init -c backupdir:

  1. create backupdir if it doesn’t exist
    if it does exist, raise an error if it is not empty
    set protection to 700 (owner RWD, group none, others none)
    Here’s an example backupdir created by hb init:

    $ ls -ld hb
    drwx------  18 jim  staff  612 May 14 19:05 hb
  2. create key.conf:
    not used directly as an encryption key;
    used to derive other keys (see below)
    not stored anywhere except in key.conf
    no hashes of it are stored
    never sent offsite
    protected 400 (owner Read, group none, others None)

    Here is a summary of key creation options. It is explained in more detail later in its own section.

    1. the default key is a 256-bit random value read from the system random number generators (see above, key generation). This is hex-encoded and written to key.conf as groups of 4 hex digits to make it easier to transcribe the key. Here’s an example key.conf:

      $ ls -l hb/key.conf
      -r--------  1 jim  staff  334 Feb 24 16:37 hb/key.conf
      $ cat hb/key.conf
      # HashBackup Key File - DO NOT EDIT!
      Version 1
      Build 1481
      Created Wed Feb 24 16:37:56 2016 1456349876.07
      Host Darwin | mb | 10.8.0 | Darwin Kernel Version 10.8.0: Tue Jun  7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 | i386
      Keyfrom random
      Key 8907 0c37 0d9a 8852 807c 1f26 b179 3b15 a994 c165 0313 5cbd c9a3 a500 5cd3 9b3a
    2. hb init also accepts a -k option. This allows the user to set their own key. It is potentially less secure, and the instructions and HB warn about this. This option lets users set a key they will be able to remember without copying the key.conf file to safe places. For example, hb init -c backupdir -k jim would write the key 'jim' to key.conf.

    3. to accommodate users' requests for no encryption, -k '' (two single quotes) creates a null key. Backups are still encrypted, but there is no key to remember. Users are warned that anyone with access to the encrypted backup data can recover the original, unencrypted data.

    4. a passphrase can be added to the key with the -p option. -p ask (add "ask" after -p) means to read the passphrase from the keyboard on every HB command. A passphrase secures backup data: 1) for users in hosted or managed environments, like a VPS; 2) when the backup directory is on USB thumb drives; 3) when the backup directory is on mounted storage like Google Drive, Amazon Cloud Drive, Dropbox, etc.

    5. the -p env (add "env" after -p) option means HB will read the passphrase from the environment variable HBPASS. This is more convenient than -k ask because the user can set the environment variable once in their login session. But it is less secure because any program the user runs also can read the passphrase. Users can add a command to their .profile startup file to set the environment variable, but this is potentially less secure and not recommended unless the user’s home directory is encrypted by the OS. The protection on .profile should be set to 0400 (user read, group none, others none).

  3. using the data in key.conf and procedures described in KEY CREATION, a single 256-bit value is generated, called key. Then:

    • an encryption key k1 is created by sha256(salt + key)

    • an authentication key k2 is created by sha256(salt + key)

    • the salts are random constants built into the HB executable. The purpose of salting the key is to obfuscate the actual encryption key so that the local hb.db file cannot be accessed outside of HB, perhaps changing it accidentally and causing bizarre bugs.

    • hb.db is encrypted with an actual AES-128 key of k1[0:16)

    • k2 is used with HMAC-SHA1 to authenticate remote files

  4. after generating the actual key, hb.db is created. It is encrypted with AES-128 in OFB mode, with a unique IV for each database page. The IV is a 4-byte page number plus 12 random bytes from an RC4-based random number generator, seeded from /dev/urandom. hb.db is created with SQLite Encryption Edition, a paid version of SQLite created by the authors of SQLite.

  5. an initial AES-128 backup key is generated using the system random number generators (see above, key generation) and stored in hb.db. This key will be used to encrypt 64GB of backup data (2^32 blocks of 16 bytes each), then a new backup key is added to the list. This is recommended for AES in CBC mode to prevent using the same key "too long".

Key Creation

hb init creates key.conf using the -k and -p options. There are 8 possible combinations, some being more secure than others. The less secure options tend to be more convenient to the user. All options are described here from most secure to least secure. Plus signs are used for "pros" while dash signs are used for "cons".

  1. no -k, -p ask
    Store 'ask' + 256-bit random in key.conf
    Read passphrase from keyboard
    Stretch passphrase + random with pbkdf2

    User:
    - must backup key.conf
    - must remember passphrase
    - can't automate HB commands (like in a cron job)
    Security: 5
    + access to key.conf is insufficient
    + random key cannot be guessed
    + passphrase is not stored
  2. no -k, no -p (default)
    Store 256-bit random key in key.conf
    No stretch needed

    User:
    - must backup key.conf
    + nothing to remember
    + can automate HB commands
    Security: 4
    - access to key.conf gives all access
    + random key cannot be guessed
  3. no -k, -p env
    Store 'env' + 256-bit random in key.conf
    Read passphrase from environment variable
    Stretch passphrase + random with pbkdf2

    User:
    - must backup key.conf
    - must remember or store passphrase
    + can automate HB commands
    Security: 4
    - other programs have access to env vars
    - user might store key in .profile file
    + random key cannot be guessed
  4. -k jim -p ask
    Store 'ask' + jim in key.conf
    Stretch passphrase with 'jim' as salt

    User:
    + easier to reconstruct key.conf manually
    - can't automate HB commands
    Security: 3
    - salt is not random
    + salt is not HB constant
    + password is not stored
  5. -k '' -p ask
    Store only 'ask' in key.conf
    Stretch passphrase w/HB constant salt

    User:
    + doesn't need to backup key.conf
    - must remember passphrase
    + easy to reconstruct key.conf
    - can't automate HB commands
    Security: 2
    - constant salt is embedded in HB program
    - constant salt = possible table attack
    + password is not stored
  6. -k '' -p env
    Store only 'env' in key.conf
    Stretch passphrase w/HB constant salt

    User:
    + doesn't need to store key.conf
    - must remember or store passphrase
    + easy to reconstruct key.conf
    + can automate HB commands
    Security: 1
    - other programs have access to env vars
    - might set env var in .profile file
    - constant salt is embedded in HB program
    - constant salt = possible table attack
    + there is a key!
  7. -k jim
    Store 'jim' in key.conf
    Stretch key w/HB constant salt

    User:
    ? may need to store key.conf for complex keys
    ? may be easy to reconstruct key.conf for simple keys
    + can automate HB commands
    Security: 1
    - access to key.conf gives all access
    - constant salt is embedded in HB program
    - constant salt -> possible table attack
    + there is a key!
  8. -k ''
    Store nothing in key.conf

    User:
    + doesn't need to store key.conf
    + nothing to remember
    + can automate HB commands
    Security: 0
    - no security if backup data is accessible

After the key.conf file is created with one of these combinations of -k and -p, the actual encryption keys are generated. The procedure is:

  • the key.conf file is read

  • for passphrase type 'ask', the user is prompted for a passphrase

  • for passphrase type 'env', the HBPASS shell variable is read

  • for ask and env, there may be a salt in the key.conf file. If there is, it is used. If not, a constant built into HB is used instead. This constant was generated from /dev/random.

  • for passphrases (-p) and user-specified keys (-k jim), pbkdf2 is used to stretch the passphrase and/or key. A 256-bit salt is used, with many thousand iterations. For -k jim -p ask/env, the salt would only be 3 bytes (jim).

  • for random keys with no passphrase (no -k or -p option), pbkdf2 is unnecessary and not used.

  • the result is a 256-bit 'master key' that all actual keys are derived from (see 3. in prior hb init section)

Backup Walkthrough

The backup function reads through the filesystem, breaks files into chunks, stores metadata in hb.db, stores file data in HB archive files, and optionally sends files offsite. Block creation is described here; sending files offsite is described later.

  1. Split the file to be backed up into blocks of either fixed or variable size, depending on the file type and options set.

  2. For each block, compute SHA1(data).

  3. Lookup the SHA1 hash in the dedup table. If found, refcount this block without writing to the archive file.

    NOTE: the dedup table is a local file and never sent offsite. But before blindly using results of this dedup lookup, the hb.db database is consulted to verify this block’s SHA1 hash matches the value in the dedup table. If not, the dedup lookup fails. This prevents the situation where someone hacks the local dedup table, or the dedup table logic has a bug, which could cause backups to be scrambled. This would be detected on restore and selftest, because file hashes would not match, but would not be detected at backup time without this extra check.

  4. For a new SHA1 hash not seen before, data is written to the archive file. The steps are:

    1. compress the data block

    2. get 256-bits (32 bytes) from the HB RNG

    3. use 128 bits (16 bytes) as the AES IV

    4. use the rest for CBC padding

    5. encrypt with AES-128-CBC

    6. write the IV and encrypted block to the archive file

  5. record the new block metadata in hb.db

Sending Archives Offsite

Archive files are sent offsite as is, without modification. The transport method (ftp, rsync, etc) is used to transfer the files and an entry is logged in dest.db on success. Details about the archive file format are discussed later. The integrity protection for archive data is the SHA1 hash stored in hb.db for each block, and the SHA1 hash for each file backed up, also stored in hb.db

Sending hb.db Offsite

Compared to archive files, it’s much harder to send hb.db offsite because it is one large file that grows as backups are added. To avoid using more bandwidth on every backup, hb.db is sent incrementally to remotes (only modified data is sent). These increments are numbered hb.db.0, hb.db.1, etc. the original hb.db is recoered by combining these increments,

Each increment is compressed and encrypted with AES-128, and contains several hashes / HMACs:

  1. public SHA1 digest of the entire increment file

  2. HMAC-SHA1 digest of this SHA1 digest, keyed with the k2 authentication key

  3. HMAC-SHA1 digest of the original hb.db file, keyed with the k2 authentication key.

The first public SHA1 digest allows anyone to verify the integrity of the hb.db.n file.

Since the public SHA1 digest is over the entire hb.db.n file, it is possible to change the file data and update this digest. The private HMAC-SHA1 digest makes this impossible without knowing the authentication key.

It would still be possible for someone on the remote side to copy hb.db.1 over hb.db.0. This would verify with both the SHA1 and HMAC-SHA1, so another HMAC-SHA1 is added - this one over the original hb.db file. In addition to catching the copy problem, it also may catch software bugs, for example, if an increment is applied improperly, in the wrong order, twice or perhaps not at all.

After the hb.db.n increment file is created, it is uploaded to remotes like archive files.

Sending dest.db Offsite

The dest.db file is a manifest of all files on each remote.

The information in dest.db is not particularly sensitive. It tells which archive files and hb.db.n increments are on which remotes, the size of these files, and the dates they were transferred. It is encrypted like the hb.db file, with AES-128, using the same key as hb.db.

Like archive files, the dest.db file is copied directly to remote destinations. After the copy, more I/O may occur to the local dest.db, so the local and remote dest.db may not always match exactly.

Restoring Files

Restoring a file uses local backup data if available, or downloads arc files from remote destinations if necessary.

During restore, hb.db is used to get a file’s original metadata (permissions, etc) and a list of blocks that make up the file data.

For each block in the file:

  • read the block from an archive file

  • the block is decrypted using the backup key stored in hb.db

  • the block is unpadded if necessary

  • it is decompressed if necessary

  • a SHA1 is computed and compared with the block’s original SHA1 in hb.db

  • if different, an error occurs and the restore aborts for this file

  • otherwise, the block is written to disk and the SHA1 file hash is updated

  • restore continues with the remaining blocks

As a precaution against a padding oracle attack (which doesn’t exactly apply to HB because the key is present, but better to be cautious), no error is raised if an error occurs during depadding or decompression. Instead, the only error reported is "block hash mismatch".

After all blocks have been written to the restored file, the restored file’s SHA1 hash is compared with the original SHA1 hash created during backup. Because this is a file hash, it can detect errors such as restore bugs, dedup problems, and so on. For example, in the very unlikely event there is a SHA1 block hash collision during dedup, this file hash will detect it, although at that point nothing can be done to fix it.

Recovering A Backup Directory

If the local backup directory is lost because the hard drive crashes, the computer is stolen, a fire occurs, etc., the only copy of the backup is on remote destinations. To recover it requires:

  • the key.conf file

  • the dest.conf file (connection parameters for remotes)

If a -p 'ask' or 'env' passphrase was used, then the user needs to know the passphrase too.

Recover connects to a destination chosen by the user and downloads dest.db. This contains a list of all files backed up at the destination. The hb.db.n increments are downloaded, verified, and applied to re-create the original hb.db database. The archive files may or may not be downloaded, depending on options used. When the local backup directory has been re-created, backups and restores can occur again.

Details for recovering each file follow.

Recovering dest.db

The dest.db file is encrypted like hb.db, but is a relatively small file so is sent and retrieved "as is", like arc files.

It would be possible to attack this on the remote by saving a dest.db and re-installing it every day on the remote, ignoring uploads of newer versions. Or, someone administering the remote storage could just delete all of the backup data. These types of attacks are possible whenever remote storage is used, and it is hard to impossible to protect from this kind of remote manipulation. One solution is to send backup data to multiple remote sites, so that if one misbehaves, others are still available.

Another option might be to display, log, or email some kind of HMAC fingerprint for the dest.db and hb.db files after every backup. Recover could display its calculated fingerprint and the user could decide whether they match. This could be implemented at any time, but for now, it seems like overkill and doubtful that users would follow this kind of procedure, so it would add little extra security.

Generating hb.db from hb.db.N Increments

This is basically the reverse process of generating an increment. Before processing an hb.db.n increment file, the SHA1 hash and HMAC of the file is verified. The recovery aborts if there is an error in either of these.

Data from the increment is decrypted, decompressed, and stored in hb.db.

After all increments have been restored, an HMAC is taken over the entire hb.db file. This must match the 2nd HMAC recorded in the last increment. If it doesn’t, an error is displayed and the recover aborts.

Rekey: Generating A New key.conf

HB has a rekey operation to generate a new key.conf key. In its default operation, hb rekey -c backupdir will generate a new 256-bit random key, follow the procedure described above under 'init' for generating keys, then re-encrypt the hb.db and dest.db files. This re-encrypt occurs within a database transaction, so it’s an all-or-nothing operation. A commit makes the changes permanent, the old key file is renamed key.conf.orig, and the new key.conf file is installed. Rekey has special handling for interrupts occurring any time during the rekey operation to rollback the opration.

No archive data is modified on a rekey operation. Archive blocks are encrypted with backup keys, which are stored in hb.db, not key.conf. These are not changed on a rekey operation. The goal of rekey is that anyone with an old copy of key.conf can no longer access the backup after the rekey.

Archive Files

Archive files are a collection of encrypted blocks.

Before any data block is used from an archive file, a SHA1 hash is computed from the decrypted, decompressed data, and then compared to the SHA1 hash originally stored in hb.db for that block. This acts like an HMAC: an attacker changing archive data would also have to update the SHA1 in hb.db, and to do that, would need the key.

An attacker with write access to an archive on a remote could also just delete it, causing data loss. This can be mitigated by copying backups to multiple destinations. They cannot cause incorrect data to be restored because of the block and file hashes stored in hb.db.