DoC Computing Support Group


How to ensure your vital data is safely backed-up

Philosophy

CSG's standard model for reliable data storage is straightforward:

  • The data on desktop and laptop computers is considered expendable. Hence the local disks on these machines must not contain the only copy of any vital data, but only transient working copies; if the machine is lost, stolen, or fails, then it should be possible for the owner to restore the data previously stored on it, or reconstructed programmatically from some other reliable source.

  • All data considered vital should be kept (or mirrored) on central CSG network file-storage that we backup for you, see the next section for details.

If you choose to step beyond this model, you are responsible for backing up data yourself.

(Please note that these classifications have no bearing on data confidentiality. Whilst important, protecting data from unauthorised access is a topic outside the scope of this document.)

What we do back-up

The only filesystems that we back up are the following:

  • Your network home-directory -- i.e. H:\ as accessed from CSG Windows machines, /homes/$USERNAME as accessed from CSG Linux machines.

  • Other network-accessible file-storage areas, i.e. most shared volumes of the form /vol/NAME on Linux, e.g. /vol/dse, /vol/lesc and /vol/www. A few /vol areas, such as /vol/bitbucket, have been designated as expendable and so are not backed-up, and there are a small number of other special cases that have case-specific backup arrangements. Most shared volumes are backed up to both online and tape backups, though a few are so big that we only do weekly tape backups.

You can check when we last successfully backed up your home directory, and whether or not we currently do backups of your favourite shared volume (and if so, when we last did so successfully), to:

  • Both copies of the nightly on-line backups, /vol/recover1 and /vol/recover2.

  • Weekly tape backups.

... by checking this secure page to see your personal backup report

So, if you store your vital data in your home directory, or a shared volume, and check that we are backing it up, your data is our responsibility and should be safe.

Things which are not backed-up

  • Laptop hard-drives.
  • Desktop hard-drives, even when they're running a standard CSG installation. (This includes /data on CSG Linux machines.)

  • Server hard-drives, even when they're running a standard CSG installation.

  • Network file-stores which have been designated as expendable, e.g. /vol/bitbucket.

Implications for users of laptops and custom desktop installations

All CSG Linux installations use, by default, your main network file-store as your local home-directory. (i.e. /homes/$USERNAME). As a result, anything you store there is backed-up. Similarly, CSG Windows installations provide access to your main home-directory via drive H:\.

However, if you choose to store data on your desktop computer's hard disk, especially if you've reinstalled your desktop with your own custom build, or are using your own laptop, it will not use any of these network shares by default -- meaning that you need to organise some suitable backup mechanism for any vital data you're working on. The rest of this document describes ways of performing such backups:

Backup strategies for standalone / custom machines

Mounting and working out of an existing reliable network file-store

If the computer you're using is constantly connected to the DoC network, then the ideal strategy for keeping your data safe is to mount some suitable network file-store onto your machine, and then conduct your work out of that area directly.

This may not be practical for some high-performance computing applications, where raw I/O performance or storage capacity become a factor; however, it is by far the simplest option.

Copying data to a durable network file-store

If the computer you're using is Internet-connected, but for some reason mounting a DoC network file-store directly is not possible (for example, if you're outside of the College firewall, or if you're only network-connected intermittently) then the next best thing would be to periodically copy your working data-set to one of the CSG network file-stores. It's really important that you set your backups to run automatically, e.g. via cron; in our experience, backup systems which aren't automated tend to be neglected.

Tools which you could consider using include:

  • rsync over ssh, preferably using pre-installed public-keys for authentication.

  • If you store your data in a revision-control system which supports distributed operation, such as git or (at a stretch) subversion, then they natively support various modes of networked operation, and can make backups a great deal easier. Store the main repository on one or our backed up network-accessible filesystems.

  • Multi-computer synchronisation tools like unison.

  • Cloud synchronisation tools like Dropbox (available as a standard feature on our current Linux desktop builds.)

Copying data to separate media

A not-uncommon strategy for standalone hosts is for users to manually copy their vital data to USB memory sticks or removable hard-drives. Whilst simple and straightforward -- and when done carefully, effective -- there are several hazards to avoid:

  • USB memory sticks are relatively easy to lose, which can cause data confidentiality issues.
  • Each backup run requires manual actions by the owner of the machine. (See the comment about the importance of automated backups, above.)
  • It is possible for the computer being backed-up to corrupt or destroy both its own live data and its backups whilst the drive is plugged in. For example, we've seen cases where a user has accidentally overwritten the live data-set with an old backup, or where a laptop in the middle of a backup run crashed, resulting in both the live data-set and the backup being lost.

As a result, while this method is useful for providing an extra level of protection in addition to some other backup mechanism, we generally advise against its exclusive use.

If you do decide to use this method, we suggest taking the following precautions:

  • Use more than one removable device, and rotate your backups around them in sequence.
  • Use a script or program to back up data of interest, rather than using manual copy instructions, so as to avoid mistakes and to ensure that everything of interest is preserved. (Apple's Time Machine software may be suitable for this purpose for those running MacOS X.)
  • If possible, keep multiple historical versions of your backed-up data on your removable device(s), in a manner similar to CSG's online backup system. This can be helpful should you discover that you need a historical, rather than a current, version of a file. Some existing revision-control tools such as git may be particularly suitable for this purpose.

Emailing important data to yourself

Whilst not suitable for large amounts of data, a common and effective strategy is for you to email copies of documents and other comparatively small files to yourself -- either to your Imperial email account, or to some third-party email provider, such as GMail. Whilst there are significant dangers with using this strategy for highly sensitive documents -- you should not consider email to be secure from tampering or eavesdropping -- this can be immensely effective strategy for backing up highly valuable documents, such as a PhD thesis.

In fact, whilst we take a lot of care when looking after your data -- keeping multiple copies online and offline, keeping offline copies in a secure, fire-resistant safe, etc. -- there are some rare classes of event which, while highly unlikely, would destroy all of the copies of your data held by CSG but would not affect an agency like Google, which has an international distributed storage infrastructure.

Publish it on the Internet

If the data you wish to preserve is not secret, then you should definitely consider publishing it on the world-wide web, e.g. via the Departmental personal web hosting services, or using the Departmental Research Publications (Pubs) service. There are a number of organisations, such as Google and archive.org, who index the entire visible web and archive it.

Additionally, if your publications are interesting, then they are likely to be mirrored by other interested parties on the Internet regardless.

As Linus Torvalds famously once wrote:

"Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it ;)"

(That said, please ensure that you have all of the necessary rights to publish the material of interest on the Internet -- and ensure that, as always, you comply with the appropriate regulations!)

 
 

guides/file-storage/backup-strategy (last edited 2011-10-26 15:08:29 by dwm)