Wednesday, 19 March 2008

Composite Primary Keys - Yuuurrghhh!

Well, nearly.

You have seen me write how I think that Composite Primary Keys are above all else a BAD THING(tm).

Well, I think they still are, as Primary Keys.

However, there is no reason not to have a unique combination of keys in a table. This I have never had an issue with.

So why?

Well, look at our tagging example. We don't want to assign a tag to an entity_type twice in our frequency table, at this point we just want to increment the frequency count. Here we hit a snag.
When we send info to our model, we don't know the id_tag_frequency, only the id_tag and the id_entity_type. This isn't a unique primary key.

However, not to worry. We just need a constraint to ensure that the combination here is unique (tags will be assigned to multiple entity_types, and entity_types will have multiple tags).

If we do this, then you can do a query lookup to return just one row.

So why not make this the Primary Key?

Well, if you do this, then you could lose out on simple coding like this

if (unique a + b) and not primary key { fetch primary key from database or create new entry }

but, more importantly, if later on you need to reference the table row in another table (which is the benefit of relational databases, isn't it?) you only need the single primary key put in the new table, instead of all the parts which make up the composite key. This maintains a DRY principle within the database, because, should you need to change part of the composite uniqueness in table 1, you don't need to do it in table 2 as well.

So, never worry about needing a unique combination of fields, that is great, but always have a unique single field primary key on the row. This should help future proof your database when you need new tables, and will make it much easier to create new programs, as you have now given people the option of searching via the composite unique key, or the single primary key.

An important aside to this. If you use an iscurrent boolean for your rows. It can be difficult to make this part of the unique composite key (because only 1 will be current), but do ensure that your code makes only 1 current for the entity it relates to. This will ensure that anyone querying the database will always be able to find the current one. I have had this problem working with a legacy database before, due to the fact that someone had put the constraint:

entity + status + date = unique composite primary key

because date (especially if stored as datetime) will generally give you a guaranteed unique combination, new rows were getting stored as iscurrent = 1, but all other rows for that entity and status were not getting set to iscurrent = 0. Whilst you could always use the following sql

select entity, status, date from table order by date

and select the one with the most recent date, you lose the benefit of just asking

select entity, status, date from table where iscurrent = 1

if that is all you want. And again, this future proofs development of new applications, as if the table is very large (and in bioinformatics data storage, they often get very exceptionally large) then developers will need to rely on the field names to determine how to use the table.

So, if you have an iscurrent, keep it up to date, otherwise, DON'T USE IT.

Rant over. Normal service resumes whenever we can determine what is normal anyway.

Tuesday, 18 March 2008

Define Existence

Not an existential question this one, but an interesting 'bug'.

I have never used if (exists $var->{key}) {}

I don't know why, but I just have never needed to. In fact, I can only just barely remember being told about it on a course.

I have always used if (defined $var->{key}) {}

What is the difference.

To quote the Camel

A variable can only be defined if it exists, however, the reverse is not necessarily true.

So, if during an initialisation step, you do

$model->{primary_key} = [result of some sql query for primary key]

but there is no primary key result as the sql statement returned no results, then

(exists $model->{primary_key}) == true


(defined $model->{primary_key}) == false

as $model->{primary_key} == NULL

Just something I found whilst trying to bug-fix why my save kept trying to go to update (and then croaking because of, you guessed it, no primary key) when it should have been going to create.

I suppose I should have know this, but when I have always, in the past, used defined to tell me if it both exists and is defined, then it's an easy thing to miss.


Monday, 17 March 2008

Are Clouds Real?

This is a bit of a summary of what we decided in the production of our database tables for tags.

Very simply, we wanted to be able to tag different entities with tags. We also want to be able
to track the tags used for an entity type (frequency, use for that entity type), so that we
could lookup that, and offer suggestions based on the usage already of that tag - i.e. a cloud
of tags for that entity_type (see scrumptious tagging for some application of that).

So, the first is simply to have a table of tags

id_tag | tag

then we have a table of runs

id_run | other info about run

and then a join table

id_tag_run | id_run | id_tag

OK, so that is quite easy. But, we don't have any opportunity to look at any information on this,
particularly our 'clouds' of tags.

We looked at the possibility of a cloud table

id_cloud | ids_of_tags_in_cloud

But how do we know the entity (remember, we want to tag many different things)

In this case, we have an entity_type table

id_entity_type | description | iscurrent

From our models, we can get them to work out their entity_type, as our models are named
xxx::model:: where entity type is the name of the table. This is just a simple
method. So, we make sure the description matches. This gives us an id_entity_type.

id_cloud | ids_of_tags_in_cloud | id_entity_type

However, now we nothing of the frequencies the tags are used for an entity. We also need to
keep appending the tag id's to a field, which could make it difficult to represent each
individual tag in the model.

At this point, our discussion came about to what clouds actually are. (at this point the
science teacher in me said a collection of water vapour). We wondered about abstracting the
cloud, and try to collect the frequencies of a tag against the entity type

So, out with the cloud table (for now?) and in with the following

id_tag_frequency | id_entity_type | id_tag | frequency

This is looking good. We can store the frequencies that a tag has been used on an entity,
and we also get an easy lookup for all tags that have been saved against an entity, which means
that we can abstract our cloud with an sql statement like this

cloud = select e.description as entity_type, t.tag as tag, frequency
from entity_type e, tag_frequency tf, tag t
where e.description = ?
and e.id_entity_type = tf.id_entity_type
and tf.id_tag = t.id_tag;

So, do we need any more information. It is useful to know who first saved a tag for an entity, and a date.
We can add this to our tag_ join table

id_tag_run | id_run | id_tag | id_user | date

So what we have

tag table:
id_tag | tag

's table:
id_ | info about entity

tag_ tables: (one for each entity)
id_tag_ | id_ | id_tag | id_user | date

entity_type dictionary table:
id_entity_type | description | iscurrent(optional)

id_tag_frequency | id_entity_type | id_tag | frequency

and clouds are purely abstract entities created from the data in the tables. We just need to use to code
to (sql or within your program if another language) to calculate the frequency.

So, in the Clearpress MVC framework (or in Rails) we create a model for each of tag* tables with the accessors
for the table and hey presto, a setup for tagging any entities in our database. You could even tag a tag :)

Wednesday, 12 March 2008

How many tests does it take to change a lightbulb?

I gave this talk at lunchtime today during our Wednesday FooMongers meeting to probably the largest number of people who have attended. It was well received.

It focused on why we should test, and how to go about testing, mostly test-driven development wise.

Rob is putting a video up on YouTube tonight of the talk, so I'll post the link tomorrow.


From: setitesuk, 8 minutes ago

Slides for a presentation on testing given to foomongers at the EBI/Sanger Insts 2008/03/12

SlideShare Link

Tuesday, 11 March 2008

Scrumptious Tagging

So, we needed to look at, as well as writing full annotations for runs, the ability to tag runs (and other entities) with keywords. Being as we are writing a web-based application, this is very web2.0.

For examples of tagging, look at, facebook or flickr.

We wanted a style very similar to However, with the range of autocomplete/autofill javascript out there in the open source community, I couldn't find anything that mimics the way works.

In a nutshell:

Display tags already given to a run (I'll use this, but replace with entity).
Click to Add tags
Change display to an entry field which has the tags in it, a cloud of all tags which have already been assigned to runs, which you can click to add or remove from the entry field, and they highlight if already in the entry field,
and the clever bit - if the user starts typing into the field, it comes up with suggestions in another revealed field of tags from the cloud, sorted by the frequency the tags have been used, showing the ten most common for the current suggestions, changing as the user types more letters.

(You may have noticed here that we are trying to get the users to essentially try to use the same tags again and again, rather than stick in a hyphen, etc)

My boss already had a two function javascript to toggle the tags in and out from the cloud, and highlight them (with some css) if they are already in the text field, but I needed to do the rest. This I managed to some up in 4 more functions.

With my bosses permission, I have included his 2 functions (crediting him of course) and I have put the javascript on sourceforge ( You can just checkout the current svn trunk to obtain the code.