## page was renamed from guides/file-storage/standalone = How to ensure your vital data is safely backed-up = == Philosophy == CSG's standard model for reliable data storage is straightforward: * The data on desktop and laptop computers is considered '''expendable'''. Hence the local disks on these machines '''must not''' contain the '''only''' copy of any '''vital''' data, but only transient working copies; if the machine is lost, stolen, or fails, then it should be possible for the owner to restore the data previously stored on it, or reconstructed programmatically from some other reliable source. * All data considered '''vital''' should be kept (or mirrored) on central CSG [[services/file-storage|network file-storage]] that we backup for you, see the next section for details. If you choose to step beyond this model, you are responsible for backing up data yourself. (Please note that these classifications have no bearing on data ''confidentiality''. Whilst important, protecting data from unauthorised access is a topic outside the scope of this document.) == What we do back-up == The '''only''' filesystems that we back up are the following: * Your network home-directory -- i.e. `H:\` as accessed from CSG Windows machines, `/homes/$USERNAME` as accessed from CSG Linux machines. * Other network-accessible file-storage areas, i.e. most shared volumes of the form `/vol/NAME` on Linux, e.g. `/vol/dse`, `/vol/lesc` and `/vol/www`. A few /vol areas, such as `/vol/bitbucket`, have been designated as '''expendable''' and so are not backed-up, and there are a small number of other special cases that have case-specific backup arrangements. Most shared volumes are backed up to both online and tape backups, though a few are so big that we only do weekly tape backups. You can check when we last successfully backed up your home directory, and whether or not we currently do backups of your favourite shared volume (and if so, when we last did so successfully), to: * Both copies of the nightly on-line backups, `/vol/recover1` and `/vol/recover2`. * Weekly tape backups. ... by checking [[https://www.doc.ic.ac.uk/csg-res/dynamic/secure/showbackups|this secure page to see your personal backup report]] So, if you store your vital data in your home directory, or a shared volume, and check that we are backing it up, your data is our responsibility and should be safe. == Things which are not backed-up == * Laptop hard-drives. * Desktop hard-drives, ''even when'' they're running a standard CSG installation. (This includes `/data` on CSG Linux machines.) * Server hard-drives, ''even when'' they're running a standard CSG installation. * Network file-stores which have been designated as ''expendable'', e.g. `/vol/bitbucket`. == Implications for users of laptops and custom desktop installations == All CSG Linux installations use, by default, your main network file-store as your local home-directory. (i.e. `/homes/$USERNAME`). As a result, anything you store there is backed-up. Similarly, CSG Windows installations provide access to your main home-directory via drive `H:\`. '''However''', if you choose to store data on your desktop computer's hard disk, especially if you've reinstalled your desktop with your own custom build, or are using your own laptop, it will not use any of these network shares by default -- meaning that '''you need to organise some suitable backup mechanism''' for any vital data you're working on. The rest of this document describes ways of performing such backups: == Backup strategies for standalone / custom machines == === Mounting and working out of an existing reliable network file-store === If the computer you're using is constantly connected to the DoC network, then the ideal strategy for keeping your data safe is to ''mount'' some suitable network file-store onto your machine, and then conduct your work out of that area directly. This may not be practical for some high-performance computing applications, where raw I/O performance or storage capacity become a factor; however, it is by far the simplest option. === Copying data to a durable network file-store === If the computer you're using is Internet-connected, but for some reason mounting a DoC network file-store directly is not possible (for example, if you're outside of the College firewall, or if you're only network-connected intermittently) then the next best thing would be to periodically copy your working data-set to one of the CSG network file-stores. It's ''really important'' that you set your backups to run automatically, e.g. via `cron`; in our experience, backup systems which ''aren't'' automated tend to be neglected. Tools which you could consider using include: * `rsync` over `ssh`, preferably using pre-installed public-keys for authentication. * If you store your data in a [[guides/version-control|revision-control]] system which supports distributed operation, such as `git` or (at a stretch) `subversion`, then they natively support various modes of networked operation, and can make backups a great deal easier. Store the main repository on one or our backed up network-accessible filesystems. * Multi-computer synchronisation tools like `unison`. * Cloud synchronisation tools like [[https://www.dropbox.com/|Dropbox]] (available as a standard feature on our current Linux desktop builds.) === Copying data to separate media === A not-uncommon strategy for standalone hosts is for users to manually copy their vital data to USB memory sticks or removable hard-drives. Whilst simple and straightforward -- and when done carefully, effective -- there are several hazards to avoid: * USB memory sticks are relatively easy to lose, which can cause data confidentiality issues. * Each backup run requires manual actions by the owner of the machine. (See the comment about the importance of automated backups, above.) * It is possible for the computer being backed-up to corrupt or destroy both its own live data ''and'' its backups whilst the drive is plugged in. For example, we've seen cases where a user has accidentally overwritten the live data-set with an old backup, or where a laptop in the middle of a backup run crashed, resulting in both the live data-set ''and'' the backup being lost. As a result, while this method is useful for providing an ''extra'' level of protection in addition to some other backup mechanism, we generally advise against its exclusive use. If you do decide to use this method, we suggest taking the following precautions: * Use more than one removable device, and rotate your backups around them in sequence. * Use a script or program to back up data of interest, rather than using manual copy instructions, so as to avoid mistakes and to ensure that everything of interest is preserved. (Apple's Time Machine software may be suitable for this purpose for those running MacOS X.) * If possible, keep multiple historical versions of your backed-up data on your removable device(s), in a manner similar to CSG's online backup system. This can be helpful should you discover that you need a historical, rather than a current, version of a file. Some existing revision-control tools such as `git` may be particularly suitable for this purpose. === Emailing important data to yourself === Whilst not suitable for large amounts of data, a common and effective strategy is for you to email copies of documents and other comparatively small files to yourself -- either to your Imperial email account, or to some third-party email provider, such as GMail. Whilst there are ''significant'' dangers with using this strategy for highly sensitive documents -- you ''should not'' consider email to be secure from tampering or eavesdropping -- this can be immensely effective strategy for backing up highly valuable documents, such as a PhD thesis. In fact, whilst we take a lot of care when looking after your data -- keeping multiple copies online and offline, keeping offline copies in a secure, fire-resistant safe, etc. -- there are some rare classes of event which, while highly unlikely, would destroy all of the copies of your data held by CSG but would not affect an agency like Google, which has an international distributed storage infrastructure. === Publish it on the Internet === If the data you wish to preserve is not secret, then you should ''definitely'' consider publishing it on the world-wide web, e.g. via the Departmental [[guides/web/personal|personal web hosting services]], or using the Departmental [[https://pubs.doc.ic.ac.uk/|Research Publications (Pubs)]] service. There are a number of organisations, such as Google and [[http://www.archive.org/|archive.org]], who index the ''entire'' visible web and archive it. Additionally, if your publications are interesting, then they are likely to be mirrored by other interested parties on the Internet ''regardless''. As [[http://en.wikiquote.org/wiki/Linus_Torvalds|Linus Torvalds]] famously once [[http://groups.google.com/groups?selm=Pine.LNX.3.91.960720095713.20645F-100000@linux.cs.Helsinki.FI|wrote]]: ''"Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it ;)"'' (That said, please ensure that you have all of the necessary rights to publish the material of interest on the Internet -- and ensure that, as always, you comply with the appropriate [[regulations|regulations]]!)