Technical‎ > ‎

Security

                        HashBackup Security Details
                                March 10, 2016

2012-05-24: misc typos
2012-09-05: edits before release
2012-11-08: removed public CRC
2012-12-13: use /dev/random + /dev/urandom for keys
            because of long hangs on Linux VMs
2013-10-24: Since Jan 2013, dest.conf can be stored in hb.db
2014-09-30: encrypted dest.db is copied directly to remotes
2016-01-05: whole file hash changed to sha1
2016-03-10: updated text to match previously listed changes

HashBackup (HB) is a Unix backup program with these design goals:

- easy to use
- accurate backups and restores
- space-efficient storage of backup data
- bandwidth-efficient network transfers
- local and/or remote backup storage
- local backup storage for fast restores
- remote backup storage for disaster recovery
- remote storage using common protocols: ftp, ssh, rsync, S3, imap, etc.
- backup data sent directly to user's remote storage
- ability to use "dumb"/passive remote storage
- use of untrusted remote storage with client-side encryption and private keys
- authentication of restored data

This note outlines in detail the security procedures in HB, with the
goal of others being able to do a security assessment.


OVERVIEW
--------

HB reads a file system, or selected files, and saves file data, file
metadata, and backup metadata to a local backup directory.  Data is
stored primarily in two places: an encrypted database named hb.db
stores metadata, and archive files store encrypted file data.

During backups, HB breaks files into blocks, suppresses redundant
blocks, compresses and encrypts blocks, and stores blocks in arc
files.  Archive files are named arc.v.n, where v is the backup number
(version) and n is a sequence number within the backup.  Archive files
are 100MB by default, but this is configurable.  So a backup of 500MB
will create arc.0.0 - arc.0.4, each ~100MB.  The next backup creates
arc.1.0, etc.

File and backup metadata is stored in the hb.db database.  File
metadata includes Unix permission bits, ACLs, file dates and so on.
Backup metadata includes cryptographic hashes for blocks, which blocks
belong to which files, and which archive file contains a block.

hb.db accumulates historical information for all backups: it knows
which blocks are needed to reconstruct version 0 of a file, and which
blocks are needed to reconstruct the most recent version.

After a backup is finished, the local backup directory has an updated
hb.db database and new archive files containing new blocks.  If remote
destinations are configured in dest.conf (a text file), this new data
is sent offsite to all remotes.

There are other utilities to remove files from the backup, list files,
perform retention, mount a backup as if it is a Unix filesystem, etc.


SECURITY OVERVIEW
-----------------

The security goal of HashBackup is secure remote storage of backup
data to untrusted storage providers.  Securing local backup data is
not a primary goal: if someone has access to a computer, there are
many avenues to monitor and attack it, from logging keystrokes to
intercepting shared library calls, installing MITM device drivers, and
so on.

Some data in HB's local backup directory stays local, for example, the
dedup information and the main key in key.conf.  Other data may be
sent offsite, with the goal of being able to reconstruct the backup
in a disaster situation: the computer is stolen, fire, flood, or the
hard disk containing the local backup crashes.

The main security goals then are to:

- send backup data offsite
- retrieve the backup data
- ensure the data was not modified while on remote storage
- ensure that restored files are identical to the original

Different sites will have different security needs.  For example,
"remote" storage may be a company-owned FTP server connected over a
LAN.  Insecure protocols like FTP are supported because the
convenience they provide may be more important to a particular
customer than the protocol's security issues.  HashBackup provides
secure transport methods, but also supports insecure but convenient
transport methods.


REMOTE DATA
-----------

There are several types of data sent to remote storage:

1. hb.db, an encrypted database, must be sent, since it contains all
metadata.  It is potentially a large file, and grows larger with each
backup.  To avoid sending the whole file after every backup, hb.db is
sent offsite as incremental "patches".  Each increment, applied in
order, yields the original hb.db file.

2. arc files (encrypted, compressed file data blocks) are sent offsite

3. a encrypted database, dest.db, maintains lists of HB's files that
are stored on each destination.  Its primary purpose is to avoid
needing to "list" the files on each remote, because sometimes this is
difficult, ie, with FTP.  Whenever a file is sent or deleted from a
remote, an entry is made in dest.db.  dest.db is sent to every remote
after each backup.  If the local backup directory is lost, dest.db is
fetched first, to tell us which other files we need to fetch and where
they are located (which remote has them).


IMPORTANT LOCAL FILES
---------------------

Two files are never sent offsite, but contain important information
that is needed to recover the backup directory from a remote site:

key.conf contains the main key.  hb init creates this key, by default
using the system random number generators (/dev/random & /dev/urandom)
to get a 256-bit key.  This is then used to derive the rest of the
keys.  The key.conf file is never sent offsite.  The key may be
protected with a passphrase.

dest.conf is a text file containing remote configuration information.
For example, connecting to a remote FTP server requires a host name,
port number, user id, and password.  Each remote destination will have
a section in dest.conf with parameters needed to connect to the
remote.  dest.conf is not encrypted, so remote account passwords and
access tokens are in the clear.  It is not possible to use hash codes
for passwords in dest.conf, because passwords must be supplied to
remote services for validation.  Users are encouraged to put very
restrictive access permissions on dest.conf.

For more security, the "hb dest load" command can be used to import
the dest.conf file into the encrypted hb.db file, then the dest.conf
file is deleted.  For best security, hb.db should also be protected
with an admin passphrase and/or main key passphrase to prevent a local
user from recovering remote passwords.

Copies of the key.conf and dest.conf text files must be stored
separately and/or printed, to be used for emergency recovery.


HB RANDOM NUMBER GENERATOR
--------------------------

HB uses random numbers to generate:
- keys
- AES initialization vectors (IV)
- AES-CBC padding

To generate keys, HB first tries to read as much of the key as
possible from /dev/random, with a non-blocking read.  If this read is
unable to return a complete key because the entropy pool is depleted,
the rest of the key is read from /dev/urandom.  While /dev/random is
preferred for key generation, it often blocks for very long times on
Linux VMs, especially right after startup, to the point of being
unusable.  On BSD machines, /dev/random and /dev/urandom are usually
equivalent and never block.

The other use of random data is for AES IVs and CBC padding.  Up to 32
bytes may be required for each backup data block.  This can require
a significant amount of random data during backup operations, and
might deplete the system's entropy pools.  /dev/urandom is a
non-blocking version of /dev/random, but using it also might deplete
the system's entropy pools during backups, especially on Linux VMs.

To prevent system entropy pool depletion, HB uses AES-128 in OFB mode
to generate cryptographically secure random numbers.  The RNG is
seeded with 48 bytes of data from /dev/urandom:

- 16 bytes for the AES-128 key
- 16 bytes for the AES-128 IV
- 16 bytes for the plaintext data to encrypt

References:

Analysis of Linux RNG:
http://www.pinkas.net/PAPERS/gpr06.pdf

Using AES as a CSRNG:
http://random.mat.sbg.ac.at/publics/ftp/pub/publications/peter/aes_sub.ps


PROCEDURES WALKTHROUGH
----------------------

The main sequence of events to use HB is:

1. create a backup directory: hb init -c backupdir; only once

2. make a series of backups: hb backup -c backupdir /home

3. do file retensions, remove files, list, etc (utilities) as needed

4. rekey if needed

5. restore files

6. recover the backup from a remote (disaster recovery)


INIT WALKTHROUGH: hb init -c backupdir
--------------------------------------

1. create backupdir if it doesn't exist
   if it does exist, raise an error if it is not empty
   set protection to 700 (owner RWD, group none, others none)
   Here's an example backupdir:

   $ ls -ld hb
   drwx------  18 jim  staff  612 May 14 19:05 hb


2. create key.conf:
   the string in key.conf is not used directly as an encryption key;
   it is used to derive other keys (see later)
   not stored anywhere except in key.conf
   hashes of it are not stored
   never sent offsite
   protected 400 (owner Read, others None)
  
   Here is a summary of key creation options.  It is explained in more
   detail later in its own section.
  
   a. the default key is a 256-bit random value read from the system
   random number generators (see above, key generation).  This is
   hex-encoded and written to key.conf as groups of 4 hex digits, to
   make it easier to transcribe the key.  Here's an example key.conf:

   $ ls -l hb/key.conf
   -r--------  1 jim  staff  334 Feb 24 16:37 hb/key.conf

   $ cat hb/key.conf
   # HashBackup Key File - DO NOT EDIT!
   Version 1
   Build 1481
   Created Wed Feb 24 16:37:56 2016 1456349876.07
   Host Darwin | mb | 10.8.0 | Darwin Kernel Version 10.8.0: Tue Jun  7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 | i386
   Keyfrom random
   Key 8907 0c37 0d9a 8852 807c 1f26 b179 3b15 a994 c165 0313 5cbd c9a3 a500 5cd3 9b3a

   b. hb init also accepts a -k option.  This allows the user to set
   their own key.  It is potentially less secure, and the instructions
   and hb warn about this.  This option lets users set a key they will
   be able to remember, without having to copy the key.conf file to
   safe places.  For example, hb init -c backupdir -k jim would write
   the key 'jim' to key.conf.

   c. users requested "no encryption"; to accomodate that, -k '' can
   be used to create a null key.  Everything is still encrypted (see
   key section), but there is no key to remember.  Users are warned
   that anyone with a copy of their backup data can read it with any
   copy of HB.

   d. for users in hosted or managed environments, like a VPS, a
   passphrase can be added to the key with the -p option.  -p 'ask'
   means to read the passphrase from the keyboard on every hb command.
   A passphrase also secures backup data when the entire backup
   directory is stored on USB thumb drives, or mounted storage like
   Google Drive, Amazon Cloud Drive, Dropbox, etc.

   e. the -p 'env' option means HB will read the passphrase from the
   environment variable HBPASS.  This is more convenient than -k ask,
   because the user can set the environment variable once in their
   login session.  But it is less secure because any program the user
   runs also can read the passphrase.  Users can add a command to their
   .profile startup file to set the environment variable, but this is
   potentially less secure and not recommended unless the user's home
   directory is encrypted by the OS.  The protection on .profile should
   be set to 0400 (owner read, no other access by others).


3. by using the data in key.conf and procedures described in KEY
   CREATION, a single 256-bit value is generated, called key.  Then:

   - an encryption key k1 is created by sha256(salt + key)

   - an authentication key k2 is created by sha256(salt + key)

   - the salts are random constants built into the HB executable.  The
     purpose of salting the key is to obfuscate the actual encryption
     key so that the database cannot be accessed outside of hb,
     perhaps changing it accidentally and causing bizarre bugs.

   - hb.db is encrypted with an actual AES-128 key of k1[0:16]

   - k2 is used with HMAC-SHA1 to authenticate remote files

4. after generating actual keys, hb.db is created.  It is encrypted
   with AES-128 in OFB mode, with a unique IV for each database page.
   The IV is a 4-byte page number plus 12 random bytes from an
   RC4-based random number generator, seeded from /dev/urandom.

5. an initial AES-128 backup key is generated using the system random
   number generators (see above, key generation) and stored in hb.db.
   This key will be used to encrypt 64GB of backup data (2^32 blocks
   of 16 bytes each), then a new backup key is added to the list.
   This is recommended for AES in CBC mode to prevent using the same
   key "too long".


KEY CREATION DETAILS
--------------------

hb init creates key.conf by combining the -k and -p options.  There
are 8 possible combinations, some being more secure than others.  The
less secure options also tend to be more convenient to the user.  They
are described here from most secure to least secure.  Dash signs are
used for "cons", while plus signs are used for "pros".

1. no -k, -p ask
   Store 'ask' + 256-bit random in key.conf
   Read passphrase from keyboard
   Stretch passphrase + random with pbkdf2

   User:
   - Must backup key.conf
   - Must remember passphrase
   - Can't automate hb commands (like in a cron job)

   Security: 5
   + access to key.conf is insufficient
   + random key cannot be guessed
   + passphrase is not stored

2. no -k, no -p (default)
   Store 256-bit random key in key.conf
   No stretch needed

   User:
   - Must backup key.conf
   + Nothing to remember
   + Can automate hb commands

   Security: 4
   - access to key.conf gives all access
   + random key cannot be guessed

3. no -k, -p env
   Store 'env' + 256-bit random in key.conf
   Read passphrase from environment variable
   Stretch passphrase + random with pbkdf2

   User:
   - Must backup key.conf
   - Must remember or store passphrase
   + Can automate hb commands

   Security: 4
   - other programs have access to env vars
   - might set env var in .profile file
   + random key cannot be guessed

4. -k jim -p ask
   Store 'ask' + jim in key.conf
   Stretch passphrase with 'jim' as salt

   User:
   + Easier to reconstruct key.conf manually
   - Can't automate hb commands

   Security: 3
   - salt is not random
   + salt is not hb constant
   + password is not stored

5. -k '' -p ask
   Store only 'ask' in key.conf
   Stretch passphrase w/hb constant salt

   User:
   + Doesn't need to backup key.conf
   - Must remember passphrase
   + Easy to reconstruct key.conf
   - Can't automate hb commands

   Security: 2
   - constant salt is embedded in hb program
   - constant salt = possible table attack
   + password is not stored

6. -k '' -p env
   Store only 'env' in key.conf
   Stretch passphrase w/hb constant salt

   User:
   + Doesn't need to store key.conf
   - Must remember or store passphrase
   + Easy to reconstruct key.conf
   + Can automate hb commands

   Security: 1
   - other programs have access to env vars
   - might set env var in .profile file
   - constant salt is embedded in hb program
   - constant salt = possible table attack
   + there is a key!

7. -k jim
   Store 'jim' in key.conf
   Stretch key w/hb constant salt

   User:
   ? May need to store key.conf (depends on chosen key length)
   ? May be easy to reconstruct key.conf (depends on chosen key length)
   + Can automate hb commands

   Security: 1
   - access to key.conf gives all access
   - constant salt is embedded in hb program
   - constant salt -> possible table attack
   + there is a key!

8. -k ''
   Store nothing in key.conf

   User:
   + Doesn't need to store key.conf
   + Nothing to remember
   + Can automate hb commands

   Security: 0
   - no security if backup data is accessible

After the key.conf file is created with one of these combinations of
-k and -p, the actual encryption keys are generated.  The procedure
is:

   - the key.conf file is read

   - for passphrase type 'ask', the user is prompted for a passphrase

   - for passphrase type 'env', the HBPASS shell variable is read

   - for ask and env, there may be a salt in the key.conf file.  If
     there is, it is used.  If not, a constant built into hb is used
     instead.  This constant was generated from /dev/random.

   - for passphrases (-p) and user-specified keys (-k jim), pbkdf2 is
     used to stretch the passphrase and/or key.  A 256-bit salt is
     used, with many thousand iterations.  For -k jim -p ask/env, the
     salt would only be 3 bytes (jim).

   - for random keys with no passphrase (no -k or -p option), pbkdf2
     is unnecessary and not used.

   - the result is a 256-bit 'master key' that all actual keys are
     derived from (see 3. in prior hb init section)


BACKUP WALKTHROUGH: hb backup -c backupdir /
--------------------------------------------

The backup function reads through the filesystem, breaks files into
chunks, stores metadata in hb.db, stores file data into HB archive
files, and optionally sends files offsite.  We'll discuss block
creation here; sending offsite is described later.

1. A file is being backed up.  It is split into blocks of either fixed
   or variable size, depending on the file type and options set.

2. For each block, compute sha1(data).

3. Lookup sha1 in the dedup table.  If found, we refcount this block
   and are done - nothing is written to the archive file.

   NOTE: the dedup table is a local file and never sent offsite.  But
   before blindly using results of this dedup lookup, the hb.db
   database is consulted to make sure this block's sha1 hash matches
   the value in the dedup table.  If not, the dedup lookup fails.
   This prevents the situation where someone hacks the local dedup
   table, or the dedup table logic has a bug, which could cause
   backups to be scrambled.  This situation would be detected on
   restore and selftest, because *file* hashes would not match, but
   would not be detected at backup time without this extra check.

4. For a new sha1 not seen before, data is written to the archive
   file.  The steps are:
   a. compress the data block
   b. get 256-bits (32 bytes) from the hb RNG
   c. use 128 bits (16 bytes) as the AES IV
   d. use the rest for CBC padding
   e. encrypt with AES-128-CBC
   f. write the IV and encrypted block to the archive file

5. record the new block metadata in hb.db


SENDING ARCHIVES OFFSITE
------------------------

When archive files are sent offsite, nothing special is done.  The
transport method (ftp, rsync, etc) is used directly to transfer the
archives and an entry is logged in dest.db on success.  Details about
the archive file format are discussed later.  The main protection for
archive data is the sha1 hash stored in hb.db for each block, and the
sha1 hash for each file, also stored in hb.db


SENDING HB.DB OFFSITE
---------------------

It's much trickier to send hb.db offsite, because it is one file,
maybe big, that grows over time as the backup grows.  To avoid using
more bandwidth every backup, hb.db is sent incrementally to remotes
(only modified data is sent).  These increments are numbered, hb.db.0,
hb.db.1, etc.  By combining these increments, the original hb.db can
be constructed.

Each increment is compressed and encrypted with AES-128, and contains
several hashes / HMACs:

 1. public SHA1 digest of the entire increment file

 2. HMAC-SHA1 digest of this SHA1 digest, keyed with the k2
    authentication key

 3. HMAC-SHA1 digest of the original hb.db file, keyed with the k2
    authentication key.

The first public SHA1 digest allows anyone to verify the integrity of
the hb.db.n file.

Since the public SHA1 digest is over the entire hb.db.n file, it is
possible to change the file data and update this digest.  The private
HMAC-SHA1 digest makes this impossible without knowing the
authentication key.

It would still be possible for someone on the remote side to copy
hb.db.1 over hb.db.0.  This would verify with both the SHA1 and
HMAC-SHA1, so another HMAC-SHA1 is added - this one over the original
hb.db file.  In addition to catching the copy problem, it also may
catch software bugs, for example, if an increment is applied
improperly, in the wrong order, twice or perhaps not at all.

After the hb.db.n increment file is created, it is uploaded to the
remote just like archive files, using the configured transport method.


SENDING DEST.DB OFFSITE
-----------------------

The dest.db file is a manifest of all files sent or waiting to be sent
to each remote.

The information in dest.db is not particularly sensitive.  It tells
which archive files and hb.db.n increments are on which remotes, the
size of these files, and the dates they were transferred.  It is
encrypted like the hb.db file, with AES-128, using the same key as
hb.db.

Like archive files, the dest.db file is copied directly to remote
destinations.  After the copy, more I/O may occur to the local
dest.db, so the local and remote dest.db may not always match exactly.


RESTORING FILES
---------------

Restoring a file uses local backup data if available, or downloads
arc files from remote destinations if necessary.

During restore, hb.db is used to get a file's original metadata
(permissions, etc) and a list of blocks that make up the file data.

For each block in the file:

- read the block from an archive file
- the block is decrypted using the backup key stored in hb.db
- the block is unpadded if necessary
- it is decompressed if necessary
- a SHA1 is computed and compared with the block's original SHA1 in hb.db
- if different, an error occurs and the restore aborts for this file
- otherwise, the block is written to disk and a SHA1 file hash is updated
- restore continues with the remaining blocks

As a precaution against a padding oracle attack (which doesn't exactly
apply to HB because the key is present, but we're being cautious), no
error is raised if an error occurs during depadding or decompression.
Instead, the only error reported is "block hash mismatch".

After all blocks have been written to the restored file, the file SHA1
hash is compared with the original SHA1 hash created during backup.
Because this is a file hash, it can detect errors such as restore
bugs, dedup problems, and so on.  For example, in the very unlikely
event there is a SHA1 block hash collision during dedup, this file
hash will detect it, although at that point nothing can be done to
fix it.


RECOVERING A BACKUP DIRECTORY
-----------------------------

If the local backup directory is lost because the hard drive crashes,
the computer is stolen, a fire occurs, etc., the only copy of the
backup is on remote destinations.  To retrieve one of these requires:

- the key.conf file
- the dest.conf file (connection parameters for remotes)

If a -p 'ask' or 'env' passphrase was used, then the user needs to
know the passphrase too.

Recover connects to a destination chosen by the user and downloads
dest.db.  This contains a list of all files backed up at the
destination.  The hb.db.n increments and all archive files are
downloaded.  The hb.db.n increments are verified and applied to
re-create the original hb.db database.  When the local backup
directory has been re-created, backups and restores can occur again.

Details for recovering each file follow.


RECOVERING DEST.DB
------------------

The dest.db file is encrypted like hb.db, but is a relatively small
file so is sent and retrieved "as is", like arc files.

It would be possible to attack this on the remote by saving a dest.db
and re-installing it every day on the remote, ignoring uploads of
newer versions.  Or, someone administering the remote storage could
Another option might be to display, log, or email some kind of HMAC
fingerprint after every backup that covers the dest.db and hb.db
files.  Recover could display its calculated fingerprint and the user
could decide whether they match.  This could be implemented at any
time, but for now, it seems like overkill and doubtful that users
would follow this kind of procedure, so it would add little extra
security.


GENERATING HB.DB FROM HB.DB.n INCREMENTS
----------------------------------------

This is basically the reverse process of generating an increment.
Before processing an hb.db.n increment file, the SHA1 hash and HMAC of
the file is verified.  The recovery aborts if there is an error in
either of these.

Data from the increment is decrypted, decompressed, and stored in
hb.db.

After all data has been restored from this increment, an HMAC is taken
over the entire hb.db file.  This should match the 2nd HMAC recorded
in each increment.  If it doesn't, an error is displayed and the
recover aborts.  This lets the user know which increment is in error.


REKEY: generating a new key.conf key
------------------------------------

HB has a rekey operation, to generate a new key.conf key.  In its
default operation, hb rekey -c backupdir will generate a new 256-bit
random key, follow the procedure described above under 'init' for
generating keys, then re-encrypt the hb.db and dest.db files.  This
re-encrypt occurs within a database transaction, so it's an
all-or-nothing operation.  A commit makes the changes permanent, the
old key file is renamed key.conf.orig, and the new key.conf file is
installed.  Rekey has special handling for interrupts occurring any
time during the rekey operation.

No archive data is modified on a rekey operation.  Archive blocks are
encrypted with backup keys, which are stored in hb.db, not key.conf.
These are not changed on a rekey operation.  The goal of rekey is that
anyone with an old copy of key.conf can no longer access the backup
after the rekey.


ARCHIVE FILE DETAILS
--------------------

Archive files are basically a list of encrypted blocks.

Before any data block is used from an archive file, a SHA1 hash is
computed from the decrypted, decompressed data, and then compared to
the SHA1 hash originally stored in hb.db for that block.  This acts
like an HMAC: an attacker changing archive data would also have to
update the SHA1 in hb.db, and to do that, would need the key.

An attacker with write access to an archive on a remote could also
just delete it, causing data loss.  But there does not seem to be any
way to cause incorrect data to be restored because of the block and
file hashes stored in hb.db.

Comments