Friday 31 July 2009

Pluggable Pipelines

My current main project is to replace the end of our analysis pipeline. It is currently all in one module, archive.pm. As with all things which are in one single file, this has been extended and changed around, and has become unmanageable.

We have also decided that we want to make it pluggable, so that we can easily add or remove parts depending on the project requirements.

For this I looked at creating a flag waver, whose job it is to take an array of function names, and launch each in turn, capturing any return values and submitting them as requirements to be fulfilled that may be needed for the next function.

The functions have no knowledge of anything except what requirements may come in, what information the object they control requires and how to capture and return further requirements for any processes further down the line.

Most importantly, the functions know nothing about any other function.

Each function loads an object, and gives it the parameters it needs. These objects handle doing any real work, which can be updating statuses, submitting jobs to LSF, manipulating files, obtaining data from databases/web services. All they have to return to the function which called them is something the next process might need to know for it's own ability to work. In our case an array of job ids from LSF submissions, to be used as job dependencies.

The structure is therefore a flagwaver, which calls the functions in a user specified order, which in turn call objects submitting jobs, returning dependencies for the next function.

What this means is that the flagwaver has very little responsibility itself. It relies on the user ensuring that any individual components will complete successfully (or at least error sensibly), and that the user has specified an order of components which will work (i.e. if 'B' depends upon an output of 'A', then they have put 'A' before 'B' in the array).

In our case, the flagwaver does have a little specific knowledge in that we have coded a few function which, if the order specified has them next to each other, they could be run at the same time, so we can parallelise as much as possible, but that is in a specific subclass of the pluggable base module, and can be overridden.