I work at the Wellcome Trust Sanger Institute, where we do a lot of processing of raw sequence data, particularly from the Next Gen sequencing machines. Over any month period we can have 320Tb of data sitting around on on our sequencing farm. The run data then gets tarballed up and moved off into repositories.
Currently, Gouying and I have been working on a MySQL database to store the QC data from the runs, so that we can get rid of the Tarballs (each themselves around 60Gb), and allow easier access to the data. Note, this is not the DNA sequence, that is being stored elsewhere. The expected annual insert into this is likely to be around 50Tb at current Illumina machine levels (and we are planning to get more!).
Roger and others have been working on storing the images for the runs longer term. This has meant moving them into an Oracle database, which means that we should have them for perhaps a year, rather than around 3 weeks. This is around 100 (tiles) * 36 (cycles) * 2 (run pair) * 8 (lanes on a chip) images per run tiff images.
Speaking with Ed at work, we discussed what was mentioned at our last Town meeting. Ed used to work for IBM, and he talked me through a bit of how processors have developed and why they are now likely to be at their limit of capability, hence going to multiple core machines. He therefore raised a good question at the town meeting - is it cheaper to store the data from the runs long term, or just rerun the sequencing?
At the moment, it costs 2% of the cost of a run to keep the data on disk, so that answers the question. Or does it?
Torvus Linalds has been quoted as saying that disk is cheap, and yes, upgrading your desk/laptop to 4x the memory is pretty cheap. However, you still need to ensure you have space for it. And there is one thing that is certainly the case, space isn't cheap. All those disks have a physical footprint. (and we could very soon run out of space in the data centre)
They also have to be managed, and people aren't cheap. We have a very good systems team, that are on 24hr callout, but that costs money.
So, it is very much a trade off. The cost of resequencing is very expensive at the moment, and storage of data is cheap, but if the next gen sequencing costs come down, then it may become very much the case just to store a plate of DNA and resequence it every time you want data from it, rather than long term storage of 4Tb+ of data.
This may be a long way off, but if the $1000 genome becomes a reality, then I think that it may change. Is this a good thing? We shall see.