Tuesday, 29 December 2009

Pragmatic Thinking and Learning

I have just finished reading Pragmatic Thinking and Learning - Refactor your Wetware by Andy Hunt. This is a very interesting book, which is well worth a read by anyone who is interested in:

a) Different ways of learning new skills
b) The way the mind works
c) The whole left side/right side debate
d) You just fancy a change from an ordinary software development textbook.

As I was reading it, a few things came to light that I did 'right' and 'wrong' last year.

Learning Moose and Erlang.

Last year I learnt Moose, and didn't learn Erlang. I set out to do both, but didn't manage it. Why? I am capable. I managed to learn to use oo perl, ruby and rails in 3 months, why not Moose and Erlang.

Simple, I had no opportunity to play with Erlang, and I had no defined targets (I want to write in Erlang using concurrency by ).

However, with Moose, it was completely different. I wanted to write a pluggable pipeline system using a modern oo perl system which others would be able to use easily. I had time to play as I learnt the basics, and then start building the structure of the framework as I learnt more.

Allowing time and scribbling.

I started to turn away from my screen and think a bit more. I also scribbled more. I created more scrap paper this year than the previous. I roughly mapped out where the code went. This worked. Thinking back to when I set up the QC application and database, I just got on and did it. That took a lot longer to achieve than it should (I got a stern warning from my bosses boss about that!) whereas the pipeline was in theory tougher, but took less than a quarter of the time in the end to get to it's initial start.

Both of these positives are described in the book as using R-mode to lead L-mode. Since I have seen some of this in action last year, I am going to try to use that in guiding me forward next year. I really need to learn some C. If there is likely to be space, I would like to learn Erlang still. However, the key is to be able to set a reason for doing so. I will.

So, first off C. Erlang isn't required for my job at this time, so I'll leave it for now.

My target for C. Finish going through the C book I have and then take a C script that my pluggable pipeline sets off, and try to work out how it all works, and see if I can refactor it. In the process of this, I should also try to write a small program to sort a user inputted list. This shoudl probably be my first mini-target once completing the book.

There are lots of other things in this book to follow. I think I probably need to reread it. I'll not write about more here now (I have a 4 year old desperate to assemble a Lego Star Wars toy with me) but I shall try to keep up to date my blogging more, and detail when it is something in from this book which has helped guide me towards it.

Friday, 18 December 2009

Another year passes...

So, another work year has passed. With it including my 10th anniversary at The Sanger Institute, and 2 years as officially a software developer, and a change of team leader, its been a pretty full year.

So, what have I achieved:

Well, first off, I decided to split up a project which helped us a lot, as we have been able to be more agile in changes and bug fixes when dealing with parts of our pipelines and code.

Following the change of leader in our team, we started to spend some time looking at new technologies. I started to take a look message queues. See Stomping on the Rabbit and Pt2.
This was my first time at actually properly failing something. It felt good. I got lots of help from monadic and their team down in London relating to RabbitMQ, and we tried ActiveMQ as well. But the reason we failed it: after attempting a comprehensive test suite to test cases where servers go down, we decided that (at this time) it wasn't stable enough for passing around the code as reliably as we hoped. I spent about 3 weeks looking at this, but we believe that is time well spent. (Another team started to employ it directly, but continually had problems with their poller of a queue.)

I also spent some time at this time taking a look at Erlang, but since there didn't seem to be much opportunity to apply this at work, I stopped for the time being.

I then deployed the badminton ladder I was working on for the Sports and Social Club. See Playing Badminton through Clearpress. This has gone down well, although I do need to make some improvements to the interface.

We had been deliberating about making the analysis pipeline pluggable, so I took this task on. I also used it as a chance to take a good look at Moose as an O-O framework. I have pretty much spent the rest of the year doing this.

This had had me really excited, and probably the thing which has kept me most interested in my job for the last 6 months, and also one thing which proved I didn't waste my time splitting up the project earlier in the year.

After much discussion of whether to relaunch scripts, or load everything to LSF in one go, we went for something that does as much as it can, then submits jobs sequentially to LSF, but with code being as decoupled as possible. You can see in the powerpoint presentation Pluggable Pipelines the style we took. With the power of the Moose behind it utilising Roles and even developing my own CPAN extensions MooseX::File_or_DB::Storage and MooseX::AttributeCloner.

With this pluggable pipeline setup, of course it is now multiple times easier to add new stuff, particularly QC testing. So, we need a way to display the results. The new boss is keen on Catalyst, so I decided now is the time to look at this. Whilst this project has moved onto another team member, I gave it a bit of a start, initially to see how it would deal with getting stuff from a file server rather than a database 'Backending Catalyst'. A couple of book sources later and I get a working app up. In fact, really, the biggest challenge was tring to apply some css that I stole from two Rails apps.

I pass my 10 year anniversary without a hitch, it seems. It is funny to find that I have worked somewhere for 10 years. Although saying that, I have done a number of different roles here, from washing glassware to sequencing dna, to instituting the widespread use of Transposon Insertion Libraries in finishing, to Quality Control, to Software Development.

And then the slope to Christmas. The pipeline systems aim to be flexible has now been pushed to the limit, with virtually unreported changes to the 3rd party pipeline which it has had to keep up with - and whilst the code unavoidably has to change, we have spent most time discussing the planned changes to cope, and testing it, rather than coding.

I've also had to come to the decision that I think Class::Std is well and truly dead. RIP. You helped me become a better, O-O programmer, and you will be remembered with a fondness, but Moose has just dealt you too many blows.

Overall, it has been quite a productive agile year. I'm taking a look at the C language and trying to lobby to get a course run at work on Moose and Catalyst. Hopefully, these will be big things for me next year. I look forward to it.

Merry Christmas, and a Happy New Year to all.

Saturday, 12 December 2009

Speedy does it.

In order to extend the flexibility to the pipeline, I developed MooseX::AttributeCloner (see CPAN) in order to pass around variables set on the original command line.

However, this has led to the need to redevelop some further loading scripts that had been originally written with Class::Std and Class::Accessor. Switching them to Moose has had a two-fold effect that I have been rather happy with.

1) Less code. Since the Roles that I had written for the pipeline (but made sure I left as potentially 'common code'), consuming these ditched a substantial amount of code. Woohoo!

2) Faster. The loading of the data runs exceptionally fast now. The dropping of the need to generate extra objects (since code was refactored into consumed roles) has increased the speed of data lookups. I haven't truly benchmarked the loading of the data, but something that was taking a couple of minutes to run through completely, now takes mere seconds. In fact, the logs don't show any change in the timestamp used in each of the print statements.

The longer of the two refactored scripts had a reduction of nearly half the code. The shorter about 45%. The slow parts (conversion to xml and database loading) of the long script needs some refactoring of how it loads. (drop xml mid stage) but now it seems that some improvements have certainly been achieved with the switch to Moose.

Another factor which has also been able to speed this up is the fact that, if the pipeline has told the scripts lots of information (such as file system paths, filenames, etc) then the code doesn't need to work it out again. A definite advantage of the combination of ustilising MooseX::Getopt and MooseX::AttributeCloner.

Saturday, 5 December 2009

Always change the branch!

Again, this one of those blogs which have shown some numptiness by me.

Why did some things break, when they were working yesterday.

We have a lot of file passing to do, and unfortunately, a third party analysis pipeline starts changing file locations and filenames, such that we have to remain agile to cope with the third party changes.

So, we discover this happening and quickly fix our pipeline to cope with a filename change. Deploy and go, everything is working as expected.

A new release of the pipeline occurs from the development branch into trunk, and then deploy. But suddenly, the files aren't being found as expected. What has happened?

Quite simply, I didn't make a corresponding change in the development branch for the original bugfix. 4 hours of developer time spent trying to track this down for something that I broke my own rule.

Always change the branch/backport trunk to ensure that the bug is fixed everywhere.

Sometimes, just trying to be too agile, sorry read trying to fix a bug fast, just causes the bug to get reimplemented further down the line. Take your time Brown! A little caution and doing it right will make it better.

Thursday, 26 November 2009

Schoolboy error teaches a lesson.

This is not a slander of Perl::Critic, but more a need to ensure that I remember the following.

Things break when there are no tests!

perl critic threw a paddy over

print STDERR q{Hello World!} or carp;

Because STDERR is not preceded by a *

So in my program I preceded to be a good little programmer, and put a * in front of all my STDERRs. (Before people ask, this is in a help statement to be outputted to the user if they didn't provide a run id, not really Hello World!, and was actually a quickie script to fix an external bug initially).

In PBP, it mentions the good practice of surrounding *STDERR with braces, but in my case, Perl::Critic didn't throw for this.

Now, here is my problem (my fault). I didn't have a test which at least compiled this script.

Can you guess what happens next? It goes into production, is launched and fails to compile, since

print *STDERR q{Hello World!} or carp;

isn't allowed. Here it is in a one liner:

>perl -e 'print *STDERR qq{hello world};'
syntax error at -e line 1, near "*STDERR qq{hello world}"
Execution of -e aborted due to compilation errors.

AAAGGHHH! It is a basic schoolboy error, I'm sure, but certainly teaches a lesson after around an hour of digging through logs to try to work out exactly why it worked last week, and not this.

All braced and ready to fly.

Wednesday, 25 November 2009

Cloning via the command line

I have update MooseX::AttributeCloner today on CPAN - v0.17

It had a few revisions over the last few days. Some due to the CPAN testers firing me back errors in my tests, some due to development requirements.

The last new thing about it though is the method 'attributes_as_command_options'.

This method goes through all of your built non-private attributes which have an init_arg, and builds a string of them for using in a command line.

The default is to render them as follows

--Boolean --attr1 val1 --hash_attr key1=val1 --hash_attr key2=val2

but you can request that the values be quoted, an = sign be placed between arg and val, and that only a single dash is used.

You can also exclude some attributes by supplying an arrayref of 'init_arg's to be excluded.

For more info, see the CPAN page

or once installed, perldoc MooseX::AttributeCloner

Unlike the new_from_cloned_attributes, you cannot supply additional key/value pairs to be added, but then, if you are generating a command line you can always add these yourself.

I hope that this proves of use to some people. It is certainly helping to reduce code and increase the flexibility of the pluggable pipeline system I have been developing.

Tuesday, 17 November 2009

Refactoring into Roles

So, the pipeline moves along, and then the bossman says:

'I want to be able to run it outside of the normal pipeline locations - how do I do it?'

After a discussion as to why you want to do this (since the normal person responsible for testing out new Illumina pipelines often does it in the usual place, moving softlinks as necessary), it is that he wanted to run some stuff to dump the output into his home directory.

OK, if that is what you want.

'Oh yes, also, can we make it more generic. A lot of things we have done imply the setup we have here (most of which is generated by the Illumina pipeline) and I would like it to move away from that'.

This is a major refactor then. Separate Business logic from Generic logic, and allow everything to be user defined, should the user want to.

Looking at it, it seems an obvious thing really, but at the time I have to say, I panicked a bit. The pipeline has been written to be pluggable, so new 'components' can be switched in or out quickly, but I hadn't really planned it to be used outside of the current setup. The principle is easy enough to apply elsewhere, and should not take long to set up, but the components were specialised to apply to what they represent.

So, the first thing I did - go and get a drink. I don't drink Tea or Coffee, but popping off for a break seemed like the right thing to do. This break lasted a while, as I thought about strategies to take. Should I go for all new objects, passing them around. How about pushing them from the command line? MooseX::Getopt seemed to give an option to put any attributes onto the command line, but what about subsequent component objects from the pipeline?

In the end, I wondered about using Moose Roles. The Manual seemed to suggest that this could be a good way forward. A 'class' that doesn't need to be instantiated or extended, but instead is consumed to become part of the class that you want. So, everything could have all the same attributes, in exactly the same way.

Now, I have to say, I could see an immediate danger here. If everything can be given to everything, then what makes anything different to be a component, or the pipeline 'flag waver'. So, I needed to be sensible, in that the pipelines need to match the components that are going to be launched from them, but not others.

First job: Sort out the pipelines. This was quite simple, and needed doing anyway, since the flag waver had a component launcher for every possible component, but most components are only used for 1 of 3 pipelines, so separate out the 3 pipeline components into subclasses. This had the added advantage of naming the 3 pipeline flag wavers as well.

OK, so what's the next thing.

Go through all components looking for common attributes and methods, and labelling them as generic, or specific. Once I had done this, I created some roles which I refactored these into. Sometimes leaving them in the class if they were unnecessary to be available in multiple classes, otherwise putting them in a role which described if they were generic/business logic and which described the type of feature they give.

After that, I just needed to apply roles to classes which meant that the flag waver had the same attributes as any component classes it would launch. This enables me to have the attribute value in the flag waver, and pass it to anything launched from it.

This left a headache in that new instantiation would need me to loop through all attributes and pass them through. So, I put together another role to do this (see http://vampiresoftware.blogspot.com/2009/11/moosexattributecloner.html for details on this).

So, now I'm left with the final problem, how to allow the user to run a pipeline with the options they want. A quick solution to this was already being used by another of my team, MooseX::Getopt. This is a great Role to apply to a class, which then enables any attribute to instantly become a command line option. Since the the Roles created above give attributes to a class, then they become command line options. Hurrah, problem solved.

So, after I initially thought that Roles would not really be worth it much, considering how much subclassing I have normally done and have been used to, now I'm convinced that this is a very useful technology.

Result of all this refactoring - over 900 lines of code lost. The value according to sloccount if $35,000 less. The code is now more maintainable, and more useful for users who want to define their own parameters.

Sorry Class::Std, you have just had another nail forcefully bashed into your coffin.

Sunday, 8 November 2009


I have just released a MooseX::AttributeCloner to the CPAN. The purpose of this role is to go pass all built attribute values from one class object to a newly instantiated object.

The reason for the creation of this role is for our pluggable pipeline system, where we want to be able to pass any parameters from the command line to any of the objects which are responsible for submitting jobs.

The Role inspects all the attributes the class has via meta, and checks that the attribute is defined. If it is, it passes the value into a hash, which it then uses in the object instantiation of the new object.

my $object_with_attributes = $self->new_with_cloned_attributes('Package::Name');

This will take $self's built attributes, and uses them in a hash_ref to create a new object instance of 'Package::Name'.

It does check the init_arg of the attribute in $self, and uses that as the key for the value. If you have a Ref in the attribute, then it will be the same Ref in the new object. (So, in this case, it isn't strictly cloned, please don't be pedantic).

If you wish to pass through additional information, or use a different value for an attribute that you already have a value for in the current object, then a hash_ref can also be provided with key/value pairs so that these can be used/added as as well

my $object_with_attributes = $self->new_with_cloned_attributes('Package::Name',{
attr1 => q{val1},...

There are also two additional methods provided by this role, 'attributes_as_json' and attributes_as_escaped_json'. With these methods, you can get a JSON string with the values of all built attributes, which you could then use elsewhere, such as just building a data hash or pushing into a commandline. More info can be found in the Documentation, but it is worth noting, objects are not stringified, throughout the whole data structure. In the case where this would be in an array, a null will be found instead.

I hope that this will be found to be useful. Please give me any feedback for this that you wish.

Friday, 6 November 2009

Extending a Role attribute

Not so much a blog post, more a blog question.

When you are 'extend'ing a Moose base class, Attributes can be extended

package Super::Hero;
use Moose;

has q{super_abilitity} => (isa => q{Str}, is => q{rw});


package My::Hero;
use Moose;
extends qw{Super::Hero};

has q{+super_abilitity} => (required => 1);


However, in this case, I would like Super::Hero to be a Role to be consumed by some class.

package Super::Hero;
use Moose::Role;

has q{super_abilitity} => (isa => q{Str}, is => q{rw});


However, the following doesn't work:

package My::Hero;
use Moose;

with qw{Super::Hero};

has q{+super_abilitity} => (required => 1);


The required 'extension' is just ignored. I have to actually declare the whole attribute again.

package My::Hero;
use Moose;

with qw{Super::Hero};

has q{super_abilitity} => (isa => q{Str}, is => q{ro}, required => 1);


I can't find anything in CPAN Documentation to confirm or deny that this the deliberate design. Does anyone have any suggestions as to any solutions to this?

Any help appreciated.

Saturday, 31 October 2009

10 years at The Sanger - Have I finally achieved my career goal

Friday the 30th was my last working day of year 10 at the Sanger Institute. I have worked in 4 teams whilst there, plus I had been there for 2.5 months before officially starting temping in the glasswash team. Also, in January, I hit the momentous age of 35. (Which, according to bible references to 3score and 10 being the life of a man, makes me middle aged).

So I wonder if I have finally reached what I wanted my career to really be. Over the next month, NaNoWriMo (http://www.nanowrimo.org/) is on and I have decided that I might just spend the next month putting to paper my life so far. I don't claim it will be great (I'm fairly average in most things), but I think it will be good to recap and look back, coming from a middle class background, who wanted to be a bricklayer when I played with Lego aged 5 and a computer programmer aged 13 (but couldn't find anything at school to help him), to being a teacher, scientist on Human Genome Project and a Perl developer.

Who knows, maybe it will interest people, maybe it won't. But a life recap can't be bad.
And maybe it will just make me think, life really isn't too bad, you know.

Sunday, 25 October 2009

Course on Moose and Catalyst

I work at the Wellcome Trust Sanger Institute. In my team, New Pipeline Development, we have identified a need for some training in Moose and Catalyst. If anyone has any course recommendations, I'd be interested in hearing about them to propose to my team and training co-ordinator.

The course can be aimed at people who are experienced Perl developers, but only have a small amount of experience of Moose and Catalyst (as anyone who reads my blog will probably identify).

Replies as Comments, or emails will be appreciated. Many Thanks in advance.

Wednesday, 21 October 2009

Does anyone else do this?

I've been away on holiday, and I'm just going back over some comments left on my blog, and got this one?

"Is that a new fashion to use q{string} instead of 'string'?"

My response:

"Just a habit I have gotten into since using perl::critic and PBP.

As '' and ' ' are not allowed by 'the rulez' and you should use q{} and q{ }, (and their double quote equivalents}, I have the coding habit of just using then straight off."

Anyone else doing this as a matter of course, or is it just me? I notice going through a number of recent books, ' and " are used in most of the code examples.

Monday, 12 October 2009

How to put your perl ironman pic on blogspot

Go to your dashboard, and under the chosen blog, select layout

Under Add and arrange page elements, select one of the two add a gadget depending if you want it on your sidebar, or at the bottom.

Select the html/javascript gadget.

Give it a title.

In content type a html img tag containing the following


replacing setitesuk with your username or nickname (male can be switched to female).

click save and hey presto, it will now appear the next time you load your blog.

Note: I am sure I am duplicating this post from somewhere. Partly, this is for my own info, partly to help anyone who also can't find the details easily (apart from mst's blog about the url for the image.)

Sunday, 11 October 2009

Is it easier to be specific?

I'm re-assessing some code, trying to refactor it into a re-usable role. Simple, you might think? But is it?

It is very easy to write your code to perform a task on a specific item, or in a specific way. From variable names which mean something tangible, to methods designed to act on a pathway which is unique to your production setup. But, how do you make it more usable?

1 - Variable-Method names

I was taught to make my variable/method names mean something. This makes the code more readable.

$donut = q{jam donut};

instead of

$d = q{jam donut};

This is a trivial example, but the principle is there.

However, you (read I) can take it too far. One such point is in directories. For the analysis pipeline, we end up with a directory, after a step called GERALD, called GERALD-date.

In the code, we put this into $gerald_dir. Sounds reasonable. Everywhere I read $gerald_dir, I know exactly what it represents.

However, here is the problem. What happens when the step and directory are renamed Harold. Whilst the principle is the same, and the same files are there, suddenly the variable name is wrong. Just grepping the filesystem won't find something like Gerald. At this point you are probably screaming at me, give it a semantic name, and document what it represents.

Exactly, but that is easy with internal local variable names, not so with public exposed method names. Suddenly I need the role to include a deprecation cycle for the replacement method names, aaahhh!

2 - Application/Locally specific

Anyone should argue that locally specific logic should exist as far up as possible, leaving it out of the generic process end logic as much as possible. No arguements there.

But how do you determine which is app specific, and which is generic.

Obviously, naming conventions are app specific. Or are they? Many things might need to know how to construct a filename in a particular way.

Ok, then how about directory structure? Again, you may find many apps wanting to access the Recalibrated data dir. They all need to know how to do get there.

This is, as you can see, quite a grey area. One where, as I am finding, I think the refactor into a generic role still needs to be a little more specific than might be first thought. You can't get to a directory without at least some knowledge of where it is likely to be. You can't open a file without at least some knowledge of how it is named.

Determine a row in a database - Some way of constructing a query to get it...

Paul Weller said "No one ever said it was gonna be easy", and I wouldnt want it any other way, but remember, if you want your code to be reusable, try to make names semantic, but not specific, and document what they represent.

Also, ensure you document why you process with particular assumptions. At least then, no-one can say you didn't tell them.

That is my plan at least (where I can) from now on.

Saturday, 10 October 2009

Just a bit of info

Whilst I don't have the code here to show this, an interesting thing that we found this week regarding a Moose attribute.

has q{attr} => (isa => q{Str}, is => q{rw});

This makes $class->attr() both reader and writer.

Now, it is mentioned in the manual that if you specify an attribute option, it will override the default by Moose, but what I was interested in was the error message.

has q{attr} => (isa => q{Str}, is => q{rw}, writer => q{_set_attr});

The error message you get here, if you try to use


Is that you are trying to modify a read_only attribute. I was expecting it to have something different to this, since 'is => q{rw}'.

Using 'is => q{ro}' still allows you to use your private writer to set the value.

Personally, I always use 'is => q{ro}' unless I want the attr name to be the writer, but this is mildly interesting that setting a writer effectively overwrites the declaration of rw to be ro. Does this therefore make it more or less confusing to the user?

Looking at it from two different perspectives:

Public attribute setting:

Someone inspecting the code for this attribute would see rw and this would tell them, hey you can set this. Hopefully they would look for a writer option before diving in to use the attr name itself. But I can see the use of declaring rw in order to give some hint.

Private attribute setting:

Someone inspecting the code would see rw, and may assume that this means they should be able to set this attribute. However, the writer should only be used internally to the class, therefore giving a hint that it can, and perhaps should, be overwritten if needed. Of course, as with any situation, let the buyer beware, this functionality may change.
However, declaring ro, you are dropping the hint that outside of the class, if you didn't set it on construction, you really shouldn't be thinking of doing so.

I'll leave it up to someone else to determine a best practice. It was an interesting thing to discover. I shall stick with declaring 'is => q{ro}, writer => q{_set_attr}' since most of the time, it is private to the class to set anyway.

Monday, 5 October 2009

The Definitive Guide to Catalyst

I have just finished reading The Definitive Guide to Catalyst - Diment and Trout (Apress). As Catalyst Newbie, all I can say is buy and read this book.

It has lots of helpful advice and guidance, from creating your first application to extending Catalyst, and lots in between.

It also takes a good look at DBIx::Class, Moose and the Reaction framework, plus a good explanation of the MVC pattern.

I think this book is going to prove a valuable addition to our teams bookshelf, and be referenced for some time to come.

Pluggable pipelines - the return

On Friday, we deployed v5.0 of the new pluggable style analysis pipeline. This now includes a pluggable analysis pipeline, the archival file creation, and qc pipeline, and archival pipline.

With this new setup, we can now add additional steps (or remove them) easily in a matter of a few hours (usually actually minutes, but then we need those tests).

If you look back to my previous post, the style is there, and we have found it really works well. It is keeping the idea of individual application code separate from the pipeline code which runs it, and even loosely couples individual apps so they run more smoothly.

An additional bonus which has occured, without even trying, is that we have a significant improvement in run time. The reason is that, since we have queued all as separate jobs, instead of as batched makefiles, LSF runs them as it finds space. This means it doesn't need to find, or earmark, 8processors with max memory allowance, but just one or two with smaller memory, so they can be assigned more efficiently. This passes stuff through faster (assuming dependency trees pan out correctly), and so users can get the data quicker. Fantastic, and even better since we didn't plan for it.

No time to stop yet though, we have lots more features to add, but since we have managed to move successfully to this more agile structure, I don't see them taking long to deploy.

Here's to agile structures and development.

Saturday, 3 October 2009

Role reversal

So, in our further investigations into using Moose, I have started to look at using Roles to produce reusable code instead of base classes.

You can find out in the Moose::Manual pages on CPAN about setting them up, but what we had found initially confusing was how much we should make a role do. So our initial investigations just led us to produce ordinary Moose objects to do the functions, but not import them into the consuming objects.

So, we move further on, and find ourselves repeating code as we don't want to import some objects which mostly do other things, so what is the solution.

I'm now taking another look at Roles, to see if that might be the solution, but with another key factor. Keep the role doing as specific a thing as possible.

An example of this is that we often need an object to have methods to expose a run id, a short run name and a run_folder. We do have an object which can provide this, along with other functions, but it is not really reusable in some situations. So I have just done a very small role run::short_info, which can then be used to import just these methods in.

The result, a small amount of (what should be) very reusable code over most, if not all, of our frameworks.

The next one I am doing is run::path_info, which will hopefully give just the next level of features that many need, but not all.

This feels very much the right way to go now, and I hope that we will have a good amount of reusable code which will increase flexibility, and keep the code low maintenance. (Well, let's keep our fingers crossed).

Just now to hope that someone fixes the requires attribute/method in Moose:Role soon.

Tuesday, 29 September 2009

system($cmd) vs qx/$cmd/ in tests

I have just found an interesting thing in my test suite.

The other day, I had some silent failures since I was capturing the output, but not checking the error code, of some commands.

my $output = `/some/command/which/normally/works -params`;

Since the output was unimportant, but the command working was, I thought I'd switch as follows

my $rc = system(q{/some/command/which/normally/works -params});
if ($rc != 0) {
croak q{Meaningfull message};

All seems fine, until I run the test suite.
t/20-mytests.t ..
ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
ok 7
ok 8
00000000ok 9
Failed 1/9 subtests

Test Summary Report
t/20-mytests.t (Wstat: 0 Tests: 8 Failed: 0)
Parse errors: Bad plan. You planned 9 tests but ran 8.

So what is wrong here.

I don't know the ins and outs, but after a bit of debugging, it is that the system command under the test framework causes the TAP not to count the test, regardless of it passing or not.

Solution to this:

Go back to

my $output = `/some/command/which/normally/works -params`;
if ($CHILD_ERROR != 0) {
croak q{Meaningfull message};

Everything is now fine again.

Saturday, 26 September 2009

Sometimes there just isn't enough

I'm starting to get confused. (OK, starting is wrong). What is the problem with more error/debug than less.

I mean, sure, your average user is not going to have the foggiest idea what your error information is, and so a nice helpful message is good. But what about in a system,(which has many many many tests) where the average time to run in production takes 1000 times longer than any of your tests. And you get for Joe User


I received this error message:

An error occured in processing your request. Please let us know at line 360 in something.pl

Could you do something about this. I had been running this process for 10 hours.



This is great, except, I know absolutely nothing further. Joe, understandably, reported the error he got (all of it!) and so thinks he has been helpful, but when you look at line 360, it's a croak of an eval block, where all your code is in multiple modules running inside it. AAAHHHH!

From looking around (and I may only have looked at a relatively small subset) people seem to think that letting the user see all of the error stack is a bad thing. Why? Speaking with some users, they don't understand the output, and would like an additional friendly message, but want to be helpful. They just don't know what to say.

My response to get info from Joe above would be:

On which node did you run this?
What parameters did you set/include?
At what time did you start and stop this process?
Did you have an error log?

To which Joe replies:

Parameters: a,b,c
I submitted it to the farm from work_cluster2
My bsub command: ....
Error log: dev/null (because, that is what he was instructed)

Great, no further output. I can't reproduce, I can't write a test, I can't track down the problem without running the whole thing myself.

So what is the potential solution:

1) Don't be afraid to show your user what looks like the internal workings of the script. Give them the whole stack output. With a friendly message too, of course.
2) Training to ensure that they write an error log (dev/null is not a log, but users don't know that often).
3) Training to ensure that they email you/bug report through rt the whole stack. If they know it will help you solve their problem, they ought to be happy. (Certainly users I spoke to said they would be happy to provide 200 lines of 'garbage output' if it meant that I could solve a problem even 25% faster.

I don't know who suggested it, or who started the culture. It might not be great for commercially sold apps, but certainly where you have internal or free (in it's many forms) software, then surely you shouldn't be worried about 'what the user might find out', because, I would be 95% certain, they don't actually care!

So big it up for confess and cluck, or even occasional print 'debug info trace points' as they ought to be the way of the future, in dealing with those bugs in a timely fashion.

Friday, 18 September 2009

readdir in the wrong order

This week we found an interesting bug (which seems quite obvious really, but still threw a spanner in the works) which was that using the readdir function doesn't equate to ls'ing a directory, i.e. you can't be sure that the files will be in alphanumeric order.

So, as you can probably work out, I had a method using this function to obtain a list of files in a directory, and then was passing the list to another program.

This program, although we make a promise that if two(or three if multpilexed) files for a lane generated with this program everything would be in the correct order, doesn't actually do any internal ordering of the list of files passed to it (where was that documented?) and so just readdir alone, passing the list of files, meant that they weren't guaranteed to be in the correct order needed. AAHHHH!

So, a quick sort to the list, and everything is now fine, but that was a bit of a surprise. So was spending 2 days trying to reorder the files that had been created wrongly, although I now have a script that can do this should it happen again, which on the farm only takes about 10 minutes.

So, as I said, an interesting bug. I will just have to remember that if I do anything with a list from readdir in future, run a sort on it afterwards, just in case.

Saturday, 12 September 2009

My first CPAN module - update

I have been politely requested to rename my first module to MooseX::File_or_DB::Storage, so I have done so and it can now be found here:


It is exactly the same as v0.2, just a different package name. MooseX::Storage::File_or_DB is scheduled for deletion off CPAN on Wed 16th, so I urge anyone to update now.



Monday, 7 September 2009

My first CPAN module

So I have pushed my first ever CPAN module, MooseX::Storage::File_or_DB


I blogged about this as I started the module before


and this weekend I was finally able to finish the first release.

As of the moment, you need to use this by extending it

package MyClass;
use Moose;
extends q{MooseX::Storage::File_or_DB};

But I hope for a future release just to be able to

use MooseX::Storage::File_or_DB;

This gives functionality to enable you to write to either a file a JSON string, or a database, and re-instantiate the object from either.

It makes heavy use of MooseX::Storage ( http://tinyurl.com/nujf4c ) - a big thanks to Tomas Doran for writing this - for inspecting the object, and providing the ability to write out to a file as a JSON string.

I hope that this will prove useful. Please do read the POD/CPAN page before use, and contact me about anything you feel relating to this - all constructive comments gratefully received.

Sunday, 6 September 2009

Backending Catalyst

I am starting to look at Catalyst as a method to display some quality control results that are coming out of our analysis pipeline.
ClearPress is a nice MVC webapp builder, but is quite a light weight framework, and uses Class::Accessor for object base. We would like to move towards using Moose based objects and need a way to integrate these into Catalyst.

I am currently working my way through the latest Catalyst book (Diment & Trout), but before this arrived I found we had the following book on our Safari Subscription - Catalyst: Accelerating Perl Web Application Development: Design, develop, test and deploy applications with the open-source Catalyst MVC framework - Jonathan Rockway.

Now, note, I had been through the Tutorial on CPAN, but couldn't find on there anything about using a Filesystem as a source for the model (Did I miss something?), but this book luckily had a section on doing so.

Firstly, why have we QC data in a filesystem?

When we run the pipeline, this all happens on a staging area, which we write everything to, and then copy all our data into long term archival databases. The QC data is no exception, but we only want to archive the final agreed data. Bioinformaticians don't seem to ever be happy with a first pass that fails, if there is any chance it could be improved (i.e. a new test pipeline version, could rerunning jut squeeze 2% more...). As such we want to view the data in exactly the same way from the filesystem as from the database, because we don't want it stored until the last possible moment.

What have we done for this?

My team have been producing Moose objects which are:

1) Producing the data
2) Storing in JSON files (MooseX::Storage)
3) Reading in JSON files (MooseX::Storage) to re-instantiate the object
4) Saving to a Database (Fey)
5) Re-instantiating from a Database

I've been working with iterations of the objects, using the files, but want the objects to just sort it themselves - I shouldn't know where the data has come from, and these objects should be used in (in fact are being written for) other applications.

Catalyst very much guides you to using a Database, and seems to prefer using DBIx::Class for this, so I need a way of guiding the Model to provide the correct objects, which are not generated directly from Catalyst helpers.

What did I do?

So in the above book, I found the section 'Implementing a FileSystem model'. This shows us how to create a Backend, which takes us out of the ordinary Model style, and the call the the Model returns this Backend object instead. We then use this Backend object to contain the logic which can be used to obtain the objects from somewhere outside of the Catalyst application, de-coupling the data models from the app, and therefore increasing flexibility and maintainability. As I said, these objects are actually being written within another application project.

This has been an interesting venture, which has enabled me to write a web application which only concentrates on the logic for the view, and leave the data handling completely to someone else. We should be production ready with the application within the week, and displaying data for the users quickly and simply.

What the betting someone asks if we can regenerate the data for all previous runs? I won't be betting against it, that's for sure.

Sunday, 30 August 2009

FileSystem/DataBase - or both?

I'm starting a new MooseX::Storage module called MooseX::Storage::File_or_DB.

The objective is that you can use the standard MooseX::Storage to serialize out the Moose
object as a JSON string to a file, and read it back again, but also save it out to
a Database so that it can be used in a usual database way (i.e. interrogate the db using
sql directly, so an attribute maps directly to a column).

There are a number of ORM or object db modules on CPAN (DBIx::Class, Fey, KiokuDB) but all seem a bit difficult to give the option to also obtain and save out the object to a filesystem as well, for more short term storage prior to archival to the database.

This is clearly something we need. MooseX::Storage is just the job for storing and retrieval from a filesystem, but linking both in one.

So, I'm setting to work on something that will do both. The current work in progress is on GitHub here


I would ask people to take a look and see what they think. The POD is currently where I want to end up, but the tests are working, and I think it is going in the right direction. Looking forward to more work on this to make it ready to submit to CPAN.

Friday, 28 August 2009

Am I too good

For our projects that are released through the Sanger Website as Open Source (most of our code is, but we don't have a specific release policy to putting it out there) we run David A. Wheelers sloccount to get a count of the lines of code.

I have just done release-3.0 of the pluggable pipeline system, so I thought it might be fun to get the stats of this.

Total Physical Source Lines of Code (SLOC) = 3,884

That's good, I've been working on this for 7 weeks, with other projects.

However, sloccount gives you further info:

Development Effort Estimate, Person-Years (Person-Months) = 0.83 (9.98)
(Basic COCOMO model, Person-Months = 2.4 * (KSLOC**1.05))

About 10 months development - I'd have been shot if what I have produced had taken that long :)

Schedule Estimate, Years (Months) = 0.50 (5.99)
(Basic COCOMO model, Months = 2.5 * (person-months**0.38))

6 months scheduling.

Estimated Average Number of Developers (Effort/Schedule) = 1.66

There's only me, and I've only been working for 7 weeks on this project!

Total Estimated Cost to Develop = $ 112,301
(average salary = $56,286/year, overhead = 2.40).

I need to ask for a raise! In fact, my boss covers his ears and refuses to listen when I mention these numbers.

Obviously, there has been discussion with other people about where the project is heading, but I think the Basic COCOMO model clearly doesn't quite cut the mustard with Agile Development practices. Or maybe I'm just too good.

Still, it is fun to watch my boss run screaming, refusing to listen when I quote the estimated cost to develop. At least I think I am worth at least what I am paid :)

'Non-'Unique indexing

We wanted to make a column in our MySQL database table (using InnoDB) nullable, but use the column as part of a composite key.

unique key C1,C2,C3,C4

C4 can be null

C1 C2 C3 C4

enter the following:

x y z a

goes into the table ok

enter those again - error that we break unique constraint.

This is as expected.


x y z null

goes into table ok

enter those again - they also enter fine, and select * from table shows two separate row entries.

So, basically, you can have a nullable field in a composite unique index, but if the column is null, you loose the unique index checking.

I don't know about other DB's, but it is a shame that the null can't be part of the uniqueness of the index.

Wednesday, 12 August 2009

Musings on a Moose

come to sweden, see the majestic moose (paraphrased from Monty Python and the Holy Grail).

I am starting to look at Moose as an alternative to Class::Std and other modules for OO Perl. It has very good support from the community, and looks to be fast becoming the framework of choice.

I will start off by saying I like it. The setup of my::module is very easy, and reads very cleanly. I like the declaration of variable types on the accessors, and it obvious when an attribute is needed on new, (just set 'required' flag). I especially like the ability to make an attribute ro.

However, that leads me to a small bugbear I have. I have mentioned before about encapsulation, Class::Std objects are blessed scalars, and as such can't have keys. Your attributes are set up as keys on internal hashes, and as such can only be exposed via a method.

The Moose object created is a blessed hash, and, the attributes are stored in keys. This means that your user can still override the ro attribute by just using the key. Encapsulation is broken.

Someone in my office used a good simile here, which I think he attributed to Larry Wall, which is:

"Your neighbour stays out of your garden, because it is the right thing to do, not because you have a shotgun."

That might be the case, but I'd rather enforce the use of accessors from the start than run the risk of user's using the key. However, I plan to try out MooseX::InsideOut to see if this can enforce this once I'm more tightly used to the basics.

Another thing I find very good are the additional modifiers for the accessors. We have found that making use of predicates and lazy_build really improves the code layout and speed of object creation.

I have now been building the pluggable pipeline project using Moose, and it has been very quick to build the objects and code. I would say that I think once upto speed, and once I have tested out MooseX::InsideOut, I think that development time of code should reduce by about 10% and code maintainability should go up about 25%.

So a big thumbs up from myself, and my development team also. Whilst I think at the moment there would be no plans to convert our ClearPress based apps to Moose, I think I'm tempted to try to switch my Class::Std modules, and certainly all new projects will go that way.

Big Thanks to all the people who work on Moose. Is there a book in the works? I for one would get it, I'd even contribute if you'd like.

Friday, 31 July 2009

Pluggable Pipelines

My current main project is to replace the end of our analysis pipeline. It is currently all in one module, archive.pm. As with all things which are in one single file, this has been extended and changed around, and has become unmanageable.

We have also decided that we want to make it pluggable, so that we can easily add or remove parts depending on the project requirements.

For this I looked at creating a flag waver, whose job it is to take an array of function names, and launch each in turn, capturing any return values and submitting them as requirements to be fulfilled that may be needed for the next function.

The functions have no knowledge of anything except what requirements may come in, what information the object they control requires and how to capture and return further requirements for any processes further down the line.

Most importantly, the functions know nothing about any other function.

Each function loads an object, and gives it the parameters it needs. These objects handle doing any real work, which can be updating statuses, submitting jobs to LSF, manipulating files, obtaining data from databases/web services. All they have to return to the function which called them is something the next process might need to know for it's own ability to work. In our case an array of job ids from LSF submissions, to be used as job dependencies.

The structure is therefore a flagwaver, which calls the functions in a user specified order, which in turn call objects submitting jobs, returning dependencies for the next function.

What this means is that the flagwaver has very little responsibility itself. It relies on the user ensuring that any individual components will complete successfully (or at least error sensibly), and that the user has specified an order of components which will work (i.e. if 'B' depends upon an output of 'A', then they have put 'A' before 'B' in the array).

In our case, the flagwaver does have a little specific knowledge in that we have coded a few function which, if the order specified has them next to each other, they could be run at the same time, so we can parallelise as much as possible, but that is in a specific subclass of the pluggable base module, and can be overridden.

Friday, 26 June 2009

Playing Badminton through Clearpress

Yesterday I launched, finally, my badminton ladder web application for our sports and social club. Initially, I write something about 3 years ago, which used flat csv files and a cgi script for each page, which dealt with generating the html and processing results, and updating the ladder, and....

All not very practical, but at least showed what I could do at the time.

Earlier this year we had a change of hardware for our webservers, and we lost the functionality, due to the time lag of all the files being kept up to date as they changed on the multiple servers. A big problem.

So I decided it was time for a rewrite, which I did using Clearpress (http://clearpress.net/), trialling out git and github at the same time (I like this so much more than sourceforge! and svn).

I have blogged about Clearpress and its MVC framework before, so I won't spend time doing so, but with a little work, the setup used 5 tables in a database to produce a reliable system.

team - stores a team name, wins, losses and gives them a unique identifer
player - stores a player name, email and gives them a unique identifier
player_team - join table for a player to a team
ladder_type - our ladder has three sub ladders, to deal with new teams and those which haven't played for a long time
ladder - links team/position/ladder_type

The whole thing can be found on github


It is currently set up to deploy using a SQLLite database, but in production use, we are using a mysql database, and the schema is there. You just need to modify the config.ini file to use a mysql database, which is supported through clearpress.

So, if you are after a web app badminton ladder, then take a look. It is all available as Open Source (GNU Public Licence).

Next, to create a Tennis Competition app.

Thursday, 11 June 2009

lc(x) - possible bad practice?

use strict; use warnings;

We all use the above, right? Well, certainly we should, and my team does;

Now this always causes a warning to occur when testing an undefined value (exception - if the test is for it to be undef).

my $arrayref = $self->method_returning_array_ref() || [];

Now, assuming that the method will always return an arrayref or undef, I will have an arrayref.

Now, it is common enough (returning stuff from an XML dom for example) that there is actually only 1 thing in the array, so I test against it

if ($arrayref->[0] eq 'yes') {
do something..

Now, if $arrayref->[0] is undef, this always throws an uninitialised variable warning. This normally leads me to change to

if ($arrayref->[0] && $arrayref->[0] eq 'yes') {
do something..

so that I don't spam up the logs with warnings.

However, we discovered today that by lowercasing the $arrayref->[0] variable, this will turn an undef into an empty string (for the conditional), therefore dispensing with any warnings.

if (lc$arrayref->[0] eq 'yes') {
do something..

Is this good or bad coding practice?

Reasons for it to be good
- You do not need an extra conditional just to dispense with the warning
- code less

Reasons for it to be bad
- It doesn't seem right
- The conditional is no longer testing against the pure result
- Are we getting rid of the warning for the wrong reason?
- Does it read correctly?

At this time, it is not something that the code police (perlcritic) seem to think is a bad practice, and certainly will make less code for us. It just seems like we are breaking an unwritten coding rule.

Wednesday, 20 May 2009

Stomping on the Rabbit, Pt2

So now I have this all installed, I have to start using it. Net::Stomp (see cpan.org) is the interface I would like to use, but the setup seems very much done for ActiveMQ (as some of the headers are 'activemq.xxxxx'.

This means that by default, I only seem to be able to send and receive messages to a queue that exists only whilst someone is listening.

The example scripts that come with the RabbitMQ-Stomp distribution only show this for PERL, however, the Ruby examples show much more (persistent queues, topics, etc).

Thankfully, I found the following post


which explains the headers for RabbitMQ to produce persistent queues and operating topics. For Net::Stomp, the key is to add these headers to the key-value pairs in your connection/send hash, and happily they go through. Here are some examples:

Persistent Queues:

Receiver -

my $stomp = Net::Stomp->new({hostname=>'localhost', port=>'61613'});
$stomp->connect({login=>'guest', passcode=>'guest'}) or croak $EVAL_ERROR;

$SIG{INT} = sub {
destination => q(/queue/father/ted),

destination => q(/queue/super/ted),
q{auto-delete} => q{false}, # setting these flags will make your queue remain whilst
q{durable} => q{true}, # the subscriber(s) go away, and pick up messages afterwards
ack => q(client),

while (1) {
my $frame = $stomp->receive_frame;
print $frame;
print $frame->body . "\n";
last if $frame->body eq 'QUIT';


Sender -

my $stomp = Net::Stomp->new({hostname=>'localhost', port=>'61613'});
$stomp->connect({login=>'guest', passcode=>'guest'}) or croak $EVAL_ERROR;
$stomp->send({destination => '/queue/super/ted',
body=>($ARGV[0] or "test\0message")});


Topics are handled differently in RabbitMQ to ActiveMQ. In ActiveMQ, they sit under a namespace /topic/xxx/yyy and a combination of the client-id, destination, exchange and routing key all make up the subscriber to the topic

In the case of RabbitMQ, it seems a bit more generic than that, but has some differences which make it (as far as I can see) non-persistent.

Receiver -

my $stomp = Net::Stomp->new({hostname=>'localhost', port=>'61613'});
$stomp->connect({login=>'guest', passcode=>'guest'}) or croak $EVAL_ERROR;

$SIG{INT} = sub {

destination => q{bananaman}, # needs to be a unique client-id
exchange => q{amq.topic},
routing_key => q{bananas},

while (1) {
my $frame = $stomp->receive_frame;
print $frame;
print $frame->body . "\n";
last if $frame->body eq 'QUIT';


This sets up a receiver, which if you use rabbitmqctl list_queues shows a queue bananaman which will receive messages (whilst he is listening) to the topic bananas

using the following gives an anonymous looking receiver queue

id => q{banaman}, # needs to be a unique client-id
destination => q{},
exchange => q{amq.topic},
routing_key => q{bananas},

which you can unsubscribe from explicitly

id => q{banaman},

However, it would appear that the unsubscription occurs anyway, so unfortunately, any posts to the topic bananas would be lost to bananaman whilst he is away.

Sender for topics -

my $stomp = Net::Stomp->new({hostname=>'localhost', port=>'61613'});
$stomp->connect({login=>'guest', passcode=>'guest'}) or croak $EVAL_ERROR;
$stomp->send({destination => q{bananas},
exchange => q{amq.topic},
body=>($ARGV[0] or "test\0message")});

You will notice here that the destination for this message is the routing_key, and that the exchange flag is set.

The problem here is that essentially, each receiver to a particular topic (routing_key) forms a temporary queue to which the message is added, and then sent from, but the queue is just that, temporary and goes away when the receiver does, so that receiver will never get the message if they are not present when the message is sent to the message-queue.

I am very keen to hear from anyone who has found out how to work around this.

The other main problem that we have seen due to this is MQ agnosticism doesn't exist using Net::Stomp, as essentially the problem is that it is written with ActiveMQ in mind, and just feeds headers through. This means that there is no conversion of the headers between using ActiveMQ-STOMP, RabbitMQ-STOMP or any other message-queue-STOMP that is out there. This means that there is no way to easily hot-swap message-queues without code re-write. Again, I'd be happy to hear form anyone who has worked out how to get around this.

As for now, it looks as though it is going to be ActiveMQ, as this has been installed centrally for us, and appears to manage the persistence a bit better, along with headers which are documented in Net::Stomp.

Friday, 15 May 2009

Stomping on the Rabbit

we are trying to make a move to using message queues in my group to deal with pipelines and talking to other apps.

There should be some great advantages for us, and I am quite excited by this.

First thing, set up a STOMP message queue.

Now, I am currently writing a simple message_queue based on ClearPress, but this is not yet ready, so I have just downloaded and installed RabbitMQ.

RabbitMQ is written in Erlang, (which I have just set myself the challenge of learning from the Pragprog book Programming Erlang by Joe Armstrong). The first challenge was setting it up.

First: You need erlang installed. I had already done this, so wasn't a problem. Jsut make sure that erlc is in your path.

Now, I had dowloaded the latest release tarball, and expanded this, setting up in my sandbox area. However, this threw my a serious curveball with trying to install STOMP.

Thankfully, Google is my friend, and someone had the same problem, since the STOMP needs installing against the correct version number, so I hereby give the definitive installation guide to getting RabbitMQ up and running in a sandbox on MacOSX.

(I accept no responsibility for this not working on your machine!)

1) Install Mercurial (yet another distributed version control system, but the one which RabbitMQ is on).

2) (Thanks to everyone on this page http://tinyurl.com/pk7xfd)
hg clone http://hg.rabbitmq.com/rabbitmq-server
hg clone http://hg.rabbitmq.com/rabbitmq-codegen
hg clone http://hg.rabbitmq.com/rabbitmq-stomp
(cd rabbitmq-server; hg up rabbitmq_v1_5_4)
(cd rabbitmq-codegen; hg up rabbitmq_v1_5_4)
(cd rabbitmq-stomp; hg up rabbitmq_v1_5_3)

3) (At this point, you need to check your version of python and simplejson, which needs installing)

4) In the various MakeFiles, alter source roots to point to where you want the various db and log files to go, where your rabbit source root is, etc..


I set up in my sandbox a folder 'rabbitmq', in which I put a logs dir, mnesia (db files) dir and rabbit-mnesia dir (into which I put the rabbitmq.conf file)

In 'rabbitmq-server/Makefile'


4) make -C rabbitmq-server
make -C rabbitmq-stomp run

Hey presto, you now have a rabbitmq-stomp server up and running.

Now you have done this, you can try either the perl or ruby tests.

Hope this guide is of use to someone.



Friday, 27 March 2009

Splitting a Project

As part of the Group I am in, we have had a project sanger-pipeline. Now, early on in the life of the group, it was quite small, and was essentially just a few wrappers around the Illumina Analysis pipeline, so that it hooked into our tracking system, and scripts to manage the movement of images off the individual Genome Analysers to our Farm.

However, as with most things, it got bigger...

and bigger...

and bigger until it was managing not only the wrappers (which were being maintained by someone who was in another group), and the movement of images, but processing other files, loading compressed versions of the images into other databases, deciding when it had everything it expected, what and when exactly to delete outdated files...and managing an apache webservice!

So, over the last 2 weeks, I have been splitting the project down. This was not a trivial thing to do, as many modules where used in multiple places, but split it I did to the following:

sanger-pipeline - this now only has wrapper scripts and modules relating to hooking the Illumina Analysis pipeline in to our system.

instrument_handling - this deals with scripts and modules relating to the smooth running and mirroring of data of the GA's, plus our controlcentre script, whose primary use is to run these as a special user (although it does do some other things).

data_handling - all these scripts and modules are responsible for managing data on the farm (once it has mirrored) including doing checks to ensure that data that needs long term storage are where they should be

sflogin_web_apps - the management of the apache server and the scripts that we run on the farm (instead of the usual webblades) to have access to files and directories located there.

At this time, the 4 projects are still linked, in so much as the top-level package namespace for the modules is srpipe:: (which means we need to be careful of ensuring that further down the line we don't accidently create two things which will conflict) and that many modules use modules from some of the other projects (code as DRY as possible).

However, these slight drawbacks are minor compared to the ability now to apply patches and new code to a project, without accidently deploying a broken development of something else (which was the tricky thing when the wrapper scripts were managed by someone separate to the group responsible for the (for example) mirroring of data).

So hopefully, now we have a much more stable codebase, and will be able to keep things running much more smoothly (or at least, develop much more fluidly).

The test coverage needs improvement, but this is something that we can further work on now. A fair amount of code had no tests at all past one which checked it compiled, so a revamp of the test suite was certainly in order, and with this, we can code directly in to ensure that the functionality doesn't change where code is reused in different projects.

Exciting times or screams ahead, who knows, but at least with hopefully more manageable project sizes, we should be able to aim for exciting...