Exclude Rules

HashBackup uses a text file inex.conf located in the backup directory to list files and directories to exclude from the backup. It’s typical to exclude temporary files, unimportant files, cache directories, database files, VM memory image files, and swap or paging files. It may seem surprising that database files are not backed up, but to get a usable backup, databases are usually dumped with a database utility while a transaction read lock is held and then the dump is backed up - not the actual database files.

Exclude File Format

The exclude file is a simple text file with each line being:

  • a blank line which is ignored

  • a comment line beginning with #, also ignored

  • a rule line with a rule type and rule pattern

The rule type can be a single letter up to the full name of the rule type. So the exclude rule type can be abbreviated as e, ex, exc, etc. There are 4 rule types, explained below. The rule pattern follows the rule type and describes a filename or pathname to be excluded. Every rule line must have both a rule type and rule pattern with the exception of a rule line beginning with /. This is a shorthand for the x rule type to allow using absolute pathnames, perhaps from the find command, as exclude rules.

HashBackup is very efficient with exclude wildcards. Exclude files containing thousands of rules, even with wildcards, will not cause performance issues. Sites often use automated tools to generate exclude rules or share exclude files between multiple users, so an exclude file can become quite large.

Initial Exclude File

When a backup directory is created with hb init, a default rule file is created. This file is system-specific for the Mac and Linux and contains 10-15 rules that apply to those systems. If the exclude file is deleted, the backup command will install a new default rule file. For no exclusions, use an empty rule file or one with only comments.

Example Exclude File (Linux):

g /home/*/.cache/
g /home/*/.gvfs
g *.vmem
x /proc/
x /tmp/
x /var/tmp/

Filenames vs Directory Names

The rule pattern that follows the rule type can describe either filenames or directories. If there is no slash in the pattern, it will exclude a file or directory anywhere in the filesystem. If there is only a trailing slash, the pattern excludes that directory’s contents anywhere in the filesystem.

Rule Types

There are 4 rule types plus one shortcut rule type:

  • x or xclude or /

  • g or glob

  • r or regex

  • e or exclude (for compatibility)

x xclude Rules

The x rule type is for non-wildcard patterns. Any characters are allowed in the pattern that are legal in pathnames, without special meaning, so a star in the pattern matches a star in the pathname.

A pattern without slashes matches anywhere in the filesystem, for example:

x abc.tmp

matches all abc.tmp files and directories anywhere.

An x pattern ending in slash excludes a directory’s contents. Sometimes it is useful to backup a directory so that it is re-created on restore, but it is not necessary to backup the directory contents. If a directory pattern ends with /, then the directory itself is saved but not the contents. If a pattern matches a directory and doesn’t end with /, then neither the directory nor its contents are saved. For example:

x /home/jim/.cache/
x /home/jim/.cache

The first rule will save an empty .cache directory without its contents, and if /home/jim is restored, an empty .cache director will be created. The second rule will not save anything about the .cache directory and if /home/jim is restored, there will be no .cache directory.

Absolute pathnames beginning with a slash can be used as shorthand for the x rule, without a rule type. This is convenient when the Unix find command is used to build exclude files. For example, note the missing x in:

/home/jim/abc
/home/jim/def
/home/jim/abc*def

The last rule is not a wildcard but matches the file named abc*def in jim’s directory.

g glob Rules

The g rule type is the most commonly used and allows wildcards in the pattern, making it possible to exclude a file for all users, exclude all temporary files, all picture files, etc. There are several wildcards in glob rules:

  • ? matches any single character

  • * matches zero or more characters except slash

  • ** matches zero or more characters including slash

  • /**/ matches zero or more directories

  • [abc] matches one character either a, b, or c

  • [!abc] matches one character not a, b, nor c

  • [a-zA-J] matches one character a-z or A-J

As with x rules, glob patterns without slashes match anywhere, for example:

g *.tmp
g abc.xyz

These rules would exclude any file or directory with a .tmp extension anywhere in the filesystem and exclude the file or directory abc.xyz anywhere in the filesystem. An x rule could have been used for the 2nd rule. The main reason to use an x rule is when the pattern contains special characters that should not be a wildcard, for example, the pathname contains star, bracket, or question mark characters.

A typical use of glob patterns is to match for all users. For example, on Linux:

g /home/*/.cache/
g /home/**/.cache/

The first rule excludes the .cache directory contents for all users. It only applies to .cache directories directly beneath the users' home directories. The 2nd rule excludes .cache directories anywhere under /home, including /home/.cache and /home/jim/app/.cache.

To exclude all .cache directory contents anywhere, the rules below are all equivalent. Obviously the simplest is the best, but the examples help explain (maybe!) what is going on:

g .cache/
g .cache/*
g */.cache/
g */.cache/*
g .cache/**
g **/.cache/
g **/.cache/**
g /**/.cache/**
g /**/**/.cache/**

Sometimes it is necessary to use glob rules with filenames that contain the special wildcard characters *, ? and [. This is possible by quoting or escaping the special character with brackets. For example:

g /home/*/abc[*]def
g /home/*/abc[?]def
g /home/*/abc[[]def

These patterns match the files abc*def, abc?def, and abc[def in users' home directories.

r regex Rules

Regular expression patterns allow more complex matching. HashBackup uses the Python 2.7 regular expression engine, so almost anything that is legal there is legal in an exclude file. The special characters are:

. matches one character
^ matches the beginning of the pathname
$ matches the end of the pathname
* matches zero or more of the previous character or group
+ matches one or more of the previous character or group
? makes the previous character or group optional
[abc] matches one character a, b, or c
[^abc] matches one character not a, b, nor c
[a-zA-J] matches one character a-z or A-J
\ next character has no special meaning, so \* matches a star
\d numbered backreferences are not supported
abc|def matches either abc or def
( begins a group, so (abc)? matches nothing or abc
) ends a group
{m} matches m repeats of the previous character or group
{m,n} matches m to n repeats of the previous character or group
There are more, but these are the most commonly used.

HashBackup uses Python’s match function, so regular expressions always have to match the beginning of a pathname. It’s as if there is an implied ^ at the beginning of each regular expression and an ^ can be used at the beginning without changing the meaning of the pattern. Pathnames are always absolute and begin with /. Since regex patterns must always match the beginning of the pathname it’s important that r patterns either begin with a slash or have a wildcard at the beginning that will match the slash. Examples:

r .cache/              will match only /cache/ (. means any character!)
r */.cache/            error, no character to repeat before *
r .*/.cache/           matches .cache, acache, bcache, etc contents anywhere
r .*/\.cache/          matches .cache contents anywhere (backslash nonwild periods!)
r .*/\.cache           exclude anything beginning with .cache anywhere (forgot $)
r .*/\.cache$          exclude .cache directory anywhere ($ is important here!)
r /home/*/\.cache/     won't match anything (forgot the . before *)
r /home/.*/\.cache/    matches .cache contents anywhere under /home, like g **
r /home/[^/]+/\.cache/ matches .cache contents in user directories, like g *
r (/abc|/def)ghi       matches /abcghi, /defghi, /abcghix/yz... and /defghitu/v...
r /(abc|def)ghi        same as above (again, no $ so a full match isn't required)
r /(abc|def)ghi$       matches only /abcghi and /defghi because of the $

Regular expressions do not have to match the complete pathname unless $ is used. If a regex needs to match the whole pathname, use $ at the end of the pattern. Some examples above show the unintended consequences of omitting the trailing $. Patterns ending in a slash do not usually use $ since they are used to exclude directory contents and it doesn’t matter what comes after the /. If a regex pattren doesn’t end in either $ or /, it’s likely wrong.

Regular expresssion are powerful but can be confusing, so test them carefully. The -v4 option to hb backup shows all excluded paths and is useful for testing exclude rules. Another way to text regex rules is with Python. When the rule matches, an object is returned. When the rule doesn’t match, another prompt is displayed. The example below shows matching with and without a trailing dollar sign. The pattern is on the left, the pathname is on the right.

$ python
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.match(r'/(abc|def)ghi', '/abcdef')   <- this does not match
>>> re.match(r'/(abc|def)ghi', '/abcghi')
<_sre.SRE_Match object at 0x101b99d50>      <- this is a match
>>> re.match(r'/(abc|def)ghi', '/abcghixyzzy')
<_sre.SRE_Match object at 0x101b99dc8>
>>> re.match(r'/(abc|def)ghi$', '/abcghixyzzy')
>>>

e exclude Rules

The e rule is an obsolete form of the g rule and is only for compatibility with earlier releases. e rules are very similar to g rules except for the handling of wildcards: e rule * and ? wildcards match across slashes whereas g rule * and ? do not. This is somewhat unexpected behavior, so in some cases a file of ex rules can be changed as-is to g rules to get the expected behavior. Stars in g rules are faster than e rules because they stop at a slash.

For the rules:

e /home/*/.cache/
g /home/*/.cache/
g /home/**/.cache/

The first rule matches .cache contents anywhere under /home user directories. It matches /home/jim/.cache/, /home/jim/app/.cache/, but not /home/.cache/. The 2nd rule matches .cache contents directly under users' directories. The 3rd rule matches .cache contents anywhere, even /home/.cache/, because /**/ can match zero directories whereas /*/ matches one directory in g rules or one or more directories in e rules.

Testing Rules

The Count command is an easy way to test exclude rules without having to do an actual backup. The --ex option shows paths that are excluded as well as the total number of paths included and excluded.

Performance Tips

  • ALWAYS exclude browser and other cache directories

  • exclude directories with hashed filenames and high turnover

  • fewer wildcards are faster

  • non-star wildcards are faster

  • g stars are faster than e stars

  • g doublestars are equivalent to e stars

  • g doublestars count as 1 star wildcard

  • more non-wildcard characters are faster

  • beginning with / and no wildcards is fastest

  • an x or / pattern with full pathname is fastest

  • a g or e pattern with no wildcards is equivalent to an x pattern

  • use an x pattern rather than escaping all wildcards in a g pattern

  • excluding extensions with g/e *.ext is fast

Exclude rules are evaluated during backups with Python’s regular expression engine and more than three star wildcards in one rule may not be handled efficiently. It works, but may affect backup performance and should be tested carefully. A g rule double star counts as one wildcard.