Friday, 27 March 2009

Splitting a Project

As part of the Group I am in, we have had a project sanger-pipeline. Now, early on in the life of the group, it was quite small, and was essentially just a few wrappers around the Illumina Analysis pipeline, so that it hooked into our tracking system, and scripts to manage the movement of images off the individual Genome Analysers to our Farm.

However, as with most things, it got bigger...

and bigger...

and bigger until it was managing not only the wrappers (which were being maintained by someone who was in another group), and the movement of images, but processing other files, loading compressed versions of the images into other databases, deciding when it had everything it expected, what and when exactly to delete outdated files...and managing an apache webservice!

So, over the last 2 weeks, I have been splitting the project down. This was not a trivial thing to do, as many modules where used in multiple places, but split it I did to the following:

sanger-pipeline - this now only has wrapper scripts and modules relating to hooking the Illumina Analysis pipeline in to our system.

instrument_handling - this deals with scripts and modules relating to the smooth running and mirroring of data of the GA's, plus our controlcentre script, whose primary use is to run these as a special user (although it does do some other things).

data_handling - all these scripts and modules are responsible for managing data on the farm (once it has mirrored) including doing checks to ensure that data that needs long term storage are where they should be

sflogin_web_apps - the management of the apache server and the scripts that we run on the farm (instead of the usual webblades) to have access to files and directories located there.

At this time, the 4 projects are still linked, in so much as the top-level package namespace for the modules is srpipe:: (which means we need to be careful of ensuring that further down the line we don't accidently create two things which will conflict) and that many modules use modules from some of the other projects (code as DRY as possible).

However, these slight drawbacks are minor compared to the ability now to apply patches and new code to a project, without accidently deploying a broken development of something else (which was the tricky thing when the wrapper scripts were managed by someone separate to the group responsible for the (for example) mirroring of data).

So hopefully, now we have a much more stable codebase, and will be able to keep things running much more smoothly (or at least, develop much more fluidly).

The test coverage needs improvement, but this is something that we can further work on now. A fair amount of code had no tests at all past one which checked it compiled, so a revamp of the test suite was certainly in order, and with this, we can code directly in to ensure that the functionality doesn't change where code is reused in different projects.

Exciting times or screams ahead, who knows, but at least with hopefully more manageable project sizes, we should be able to aim for exciting...