Browser Caches

Disk-based web browser caches present several unique challenges to backup programs and should be excluded whenever possible, especially in a file-server environment where hundreds or thousands of users' browser caches are present. Many of these backup challenges apply to any large collection of small files, but some are unique to browser caches:

  • cache entries are often stored in thousands of small, individual files rather than a large database of cache data

  • the default browser cache size for a single user is often tens of thousands of files. For example, the default Firefox browser cache on a sample Mac OSX laptop contained 25K files

  • backing up 1000 1K files is slower than backing up a 1MB file because many filesystem calls are required for each file backed up, so the number of system calls per byte backed up is high. This is even more noticeable in an NFS or CIFS/SMB network server environment because each system call may result in a network operation

  • small files cause slowdowns because HashBackup has to perform many database operations for each file backed up, ie, the number of database operations per byte backed up is also high

  • the percentage of wasted space for small files is much larger. A filesystem with a typical 4K block size averages 2K wasted space per file since the last block on average will be only half full. For a 1MB file, this waste is 0.2% (2K/1M), but for a 15K file it is over 13% (2K/15K), or 65 times higher. This high percentage of wasted disk space for small files translates into lower disk I/O efficiency

  • many more disk seeks are typically required to backup thousands of small files, and disk seeks lower throughput

  • hash codes are often used for browser cache entry filenames, for example, the SHA1 hash of a URL. A SHA1 hash is 20 bytes, but is usually hex-encoded to 40 bytes when used as a filename. Now we have not only thousands of files to backup, but each has a long 40-byte filename. The average filename on an OSX filesystem is 15 bytes, so these filenames are nearly 3x larger. Filenames are indexed for quick retrieval inside HashBackup. Long filenames create more index pages in the backup database, leading to higher RAM requirements for the database cache

  • when data is changed in ordinary files, the pathname stays the same. But if content-based hash codes are used for filenames, every change results in a new, unique pathname that has to be tracked in the backup

  • even though browser cache entries may be expired (deleted) from the browser cache stored in the filesystem, these entries will usually live much longer in the backup, depending on the deleted file retention policy. A 30-day deleted file retention policy is probably a minimum. Instead of having "just" tens of thousands of browser cache entries for a single user, inside the backup there can easily be 30x this, with most files being marked deleted

  • browser cache directories are often structured as multi-level trees using parts of the hash code filename. For example, if the cache entry filename is 23188a7e802bbdb5546ebd69b35c58bb3f8dc278, it might be stored as cache/2/3/188a7e802bbdb5546ebd69b35c58bb3f8dc278. A single user’s browser cache therefore results in 256 directories to be backed up (16 x 16). If the cache is structured with 2-character directory names, it results in 64K directories to backup

  • most filesystems try to organize files on disk so they are close to their parent directory. But directories of a browser cache tree may be more scattered around the disk, causing extra seeks at backup time

  • even if a user has not signed in (so the cache is unmodified), thousands of filesystem and database operations are required on every backup to verify that no cache files have changed

  • skipping unmodified files very quickly is what makes incremental backups fast. In a typical filesystem, the modified file rate is often less than 1% of all files. But browser caches tend to turnover rapidly: it would not be unusual to modify 50-100% of the cache entry files in one day of active use, and this clearly will slow down incremental backups.

  • and finally, browser cache data has extremely low value and is not worth backing up even if it were easy. It’s especially not worth it given that it has unique challenges that usually cause backup performance problems.

In short, make sure to exclude all browser caches from backups. This can be done via the inex.conf file, for example, the rule:

ex /home/*/.cache/

will exclude the contents of /home/jim/.cache, /home/jeff/.cache, etc. The trailing slash causes the directory itself to get backed up but not its contents. After a restore, users will have an empty .cache directory. Without the trailing slash, the .cache directory would be missing.

Avoiding Server-based Browser Caches

Server-based browser caches cause headaches for backup programs, but likely cause problems for the server itself for many of the same reasons. Here are several options to avoid storing browser caches on a file server:

  • most browsers have a disk-based cache and a memory cache. The disk-based cache can be disabled and the memory cache enlarged, often without any noticeable performance change. This distributes the browser cache load back to the client rather than the file server

  • users may have their home directory on a file server, and this may be the default location of the disk-based browser cache. But the location of the cache can be changed to a directory local to the client, again distributing the load back to clients rather than the file server

  • a site or department-wide caching server can be deployed to provide web caching services to users after the file-server-based caches have been turned off. This removes the load from the file server. The caching server should still be backed up to preserve its configuration, but the cache itself is not backed up.

Required Browser Cache Backups

If there is a policy requiring the backup of users' browser caches, it will likely cause performance problems. One option that might help is to exclude the browser cache directories from the normal HashBackup backup and use another tool like tar to create one large file for each user that consolidates their browser cache data. Then back that up in the normal system backup. The advantages are:

  • this tar file can be created on a different schedule, maybe only once per week instead of every day

  • the tar file will have the same filename every day, solving the long filename and unique (hash code) filename problems

  • many consolidating tars can be run in parallel to improve scalability on large servers with thousands of users

  • the consolidating tar backup can be performed only for users that have recently logged in

  • the consolidating backup can use special, cache-specific information. For example, there is usually a cache index file that is changed whenever the cache is changed. If the cache index file has not changed, it’s not necessary to scan the browser cache tree or generate a new consolidating backup

If attempting this consolidating tar browser cache backup idea, be aware that:

  • the consolidating backup should not be compressed, because that will defeat HashBackup’s dedup capabilities

  • dedup will work to an acceptable degree on the consolidated tar file, but because browser entries are typically very small and the average dedup block size is 48K, dedup will not be as effective as it is with individual files

  • an additional tar step will be required to restore a user’s browser cache if a user’s home directory is restored

If the browser cache is being backed up for forensic purposes, this can often be accomplished by only backing up the cache index rather than the cache data itself, or by using a caching proxy server and backing up its log files.