Saturday 18 October 2008

Disk is cheap?

I work at the Wellcome Trust Sanger Institute, where we do a lot of processing of raw sequence data, particularly from the Next Gen sequencing machines. Over any month period we can have 320Tb of data sitting around on on our sequencing farm. The run data then gets tarballed up and moved off into repositories.

Currently, Gouying and I have been working on a MySQL database to store the QC data from the runs, so that we can get rid of the Tarballs (each themselves around 60Gb), and allow easier access to the data. Note, this is not the DNA sequence, that is being stored elsewhere. The expected annual insert into this is likely to be around 50Tb at current Illumina machine levels (and we are planning to get more!).

Roger and others have been working on storing the images for the runs longer term. This has meant moving them into an Oracle database, which means that we should have them for perhaps a year, rather than around 3 weeks. This is around 100 (tiles) * 36 (cycles) * 2 (run pair) * 8 (lanes on a chip) images per run tiff images.

Speaking with Ed at work, we discussed what was mentioned at our last Town meeting. Ed used to work for IBM, and he talked me through a bit of how processors have developed and why they are now likely to be at their limit of capability, hence going to multiple core machines. He therefore raised a good question at the town meeting - is it cheaper to store the data from the runs long term, or just rerun the sequencing?

At the moment, it costs 2% of the cost of a run to keep the data on disk, so that answers the question. Or does it?

Torvus Linalds has been quoted as saying that disk is cheap, and yes, upgrading your desk/laptop to 4x the memory is pretty cheap. However, you still need to ensure you have space for it. And there is one thing that is certainly the case, space isn't cheap. All those disks have a physical footprint. (and we could very soon run out of space in the data centre)

They also have to be managed, and people aren't cheap. We have a very good systems team, that are on 24hr callout, but that costs money.

So, it is very much a trade off. The cost of resequencing is very expensive at the moment, and storage of data is cheap, but if the next gen sequencing costs come down, then it may become very much the case just to store a plate of DNA and resequence it every time you want data from it, rather than long term storage of 4Tb+ of data.

This may be a long way off, but if the $1000 genome becomes a reality, then I think that it may change. Is this a good thing? We shall see.

Wednesday 8 October 2008

Ajax - Web 2.0 Primer course.

So last week I went on a four day course to find out more about using Ajax to make web2.0 sites. The course was actually 2x4 day courses attempted to be fitted in to 1x4 day course.

For me, the first course was pretty simple, it was supposedly the prerequisite course fro the Ajax primer. We covered the Javascript basics, but in order to move on, the course instructor assumed our knowledge of HTML and css. For me this was OK, but some people didn't seem too up with the whole css selector specs instead of using inline markup. However, this did 'get' us through the course in a day.

We then jumped to the Ajax course. As I have blogged before, earlier this year I bought Pragmatic Ajax: A web2.0 primer (www.pragprog.com). Most of the second course was actually summed up in this book (although I did learn more about some Best Practices). We went through from making an xhr object that is cross-browser compatible, to starting our own library, and then we did some examples using jQuery, Prototype and Scriptaculous.

The course was from Learning Tree, and the instructor was very knowledgable, showing us some other books that we might want to read, primarily Douglas Crockford - Javascript, The best bits (www.oreilly.com) and John Resig - Pro Javascript Techniques (apress).

The thing that interested me the most was probably something that had the least time spent on it. For my toy app, YADB, I wanted to be able to drag and drop cards from either a card list or inventory list into a deck container. We were shown how to do this using Scriptaculous (quite fortunate, considering YADB is Ruby on Rails).

Work wise, it was good to be shown exactly how to look at blackboxing functions, and how to pass JSON and objects around.

Two disappointing features of the course - example code snippets did not follow what 'Best Practices Guy' was saying, sometimes even in the same slide, and some of the exercises on the computer didn't match the code in the book.

Also, no fault of the instructor, but somewhere along the line, they hadn't explained to Learning Tree that most people on my site primarily use Perl,
so some of us were a bit stuck when it came to modifying some of the server side code, as we had little/no experience of the 3 they had chosen (jsp, php and .net)

Overall, a good course, that could have been better, but worth going on.

I must contact Learning Tree though about doing the 2 module tests for them though.

An aside to this, why is everything PHP. I had a couple of email links come though about web app jobs coming up, and everyone wants PHP. I could be a bit biased, but certainly a lot of people I know think that PHP is probably one of the worst technologies to come along for a while. Certainly one thing that has been blogged on Perl Buzz is that you can't bug report anything from an older version than the current release, which seems a bit pointless to me. However, I am not going to go into it now, I think I just clearly need to learn it. Better hit the bookshelves!