Saturday 18 October 2008

Disk is cheap?

I work at the Wellcome Trust Sanger Institute, where we do a lot of processing of raw sequence data, particularly from the Next Gen sequencing machines. Over any month period we can have 320Tb of data sitting around on on our sequencing farm. The run data then gets tarballed up and moved off into repositories.

Currently, Gouying and I have been working on a MySQL database to store the QC data from the runs, so that we can get rid of the Tarballs (each themselves around 60Gb), and allow easier access to the data. Note, this is not the DNA sequence, that is being stored elsewhere. The expected annual insert into this is likely to be around 50Tb at current Illumina machine levels (and we are planning to get more!).

Roger and others have been working on storing the images for the runs longer term. This has meant moving them into an Oracle database, which means that we should have them for perhaps a year, rather than around 3 weeks. This is around 100 (tiles) * 36 (cycles) * 2 (run pair) * 8 (lanes on a chip) images per run tiff images.

Speaking with Ed at work, we discussed what was mentioned at our last Town meeting. Ed used to work for IBM, and he talked me through a bit of how processors have developed and why they are now likely to be at their limit of capability, hence going to multiple core machines. He therefore raised a good question at the town meeting - is it cheaper to store the data from the runs long term, or just rerun the sequencing?

At the moment, it costs 2% of the cost of a run to keep the data on disk, so that answers the question. Or does it?

Torvus Linalds has been quoted as saying that disk is cheap, and yes, upgrading your desk/laptop to 4x the memory is pretty cheap. However, you still need to ensure you have space for it. And there is one thing that is certainly the case, space isn't cheap. All those disks have a physical footprint. (and we could very soon run out of space in the data centre)

They also have to be managed, and people aren't cheap. We have a very good systems team, that are on 24hr callout, but that costs money.

So, it is very much a trade off. The cost of resequencing is very expensive at the moment, and storage of data is cheap, but if the next gen sequencing costs come down, then it may become very much the case just to store a plate of DNA and resequence it every time you want data from it, rather than long term storage of 4Tb+ of data.

This may be a long way off, but if the $1000 genome becomes a reality, then I think that it may change. Is this a good thing? We shall see.

6 comments:

RobHu said...

Don't forget that the cost of storage reduces over time as well.

One of the biggest costs for storage is going to be the capital cost of the datacenter, but you've already built that.

Hard disks continue to get bigger for the same price, but the rate at which their size increases is slowing. So that's bad.

But on the other hand Solid State Disks are now ramping up. You can get 128 and 256GB SSDs now. In 5 years terrabyte or multi terrabyte SSDs will be the standard I would have thought.

One of the big advantages of SSDs (there are many) is that they use less power than hard disks (well, there is some debate about this - a spun down disk may use less, but in general under the kind of usage pattern you have they'd use less power). This is relevant because one of limitations on the datacenter (and the cost of the datacenter) is power / cooling.

RobHu said...

Also - a big chunk of the datacenter is currently unusued (as in it has no equipment in it at all).

Unknown said...

It is true that the data centre has space, but part of that is as a 'sandbox' for installing new hardware before putting it in to main use.

Part of this was the debate of whether over a v. long term, disc (incl. management and footprint) actually is as cheap as it sounds. As you say, cost of storage does reduce over time, but I think in the very long term, we shall find the matter different.

Also, there is an environmental cost of updating hardware (and making some redundant) which when it comes to changing the hard disk something is stored on, that has an effect to.

RobHu said...

When I say there is space I mean there are empty rooms that have no racks or equipment of any kind in them at present.

In terms of the future of storage - the relevant issues seem to be whether the cost / density of storage (as a first order approximation, the size of disk you get for £x) falling/rising at a rate greater or not than the amount of money in your budget and the amount of data you want to store. If for x pounds and y space the amount of storage you can buy doubles every three years, but the amount of data you need stored triples every three years then there is a problem. Otherwise you're OK.

In terms of the environment - I don't know how things work at Sanger, but generally at EBI our equipment has a life span of 3-5 years. After that point it's out of warranty / support, and probably isn't designed to work for that long anyway. I think it's just part of how these things work that everything has to be replaced every 2-5 years.

RobHu said...

Another thing that's maybe worth considering...

There is a cost (and possibly a technology) difference between storing the data online, and storing it offline.

Disks in trays in the datacenter are going to be more expensive (in terms of TCO) than disks that live in the datacenter for a bit, then get swapped out and get stored somewhere (or possibly tape).

Obviously then the data isn't 'online', so you need a few hours / day or so for a techie to go and find the disk / tape in storage and restore it on to the live system (which might mean just plugging it in). Whether that's acceptable or not will depend on how urgently you need access to older data.

Unknown said...

All quite valid points.

I think that we are generally wanting our data to remain online indefinitely, which is something that will add to the problem.

Anyway, it's an interesting discussion point, however you look at it :)