Let’s say we download a bunch of public domain data that we’re considering for use in a service we’re building. Like so much of the data grudgingly made available to the public by government organizations, the data comes with no ICD, no comments, no accompanying documents at all. The only thing we know are the general domain (let’s say – species of flowers) and the field names. We don’t know what the primary key is, or even if any single field can act as the primary key. In this post we’ll use a locally installed sqlite database, perfect for small and quick experimentation. For larger apps postgresql is recommended).
This is what flower_species looks like (completely fanciful – like how often the flower has been mentioned in poetry – these are not really from any source):
First things first – let’s see what uniquely identifies each record.
Let’s start out by seeing how many records there are:
sqlite> select count(*) from flower_species; 58691
Okay, so there are 58, 691 records all in all.
There are also a lot of fields for each record. They currently don’t mean much to us, but what we want to know right now is just which fields, when evaluated together, can uniquely identify a record. One way of doing this is by using the
sqlite> select count() from (select distinct petal_cd from flower_species); 2540 sqlite> select count() from (select distinct stem_cd from flower_species); 218
Now let’s try stems and petals together:
sqlite> select count() from (select distinct petal_cd, stem_cd from flower_species); 28356
Much better! We’ll throw in a couple more fields:
sqlite> select count() from (select distinct type_cd, petal_cd, stem_cd from flower_species); 30927 sqlite> select count(*) from (select distinct color_cd, type_cd, petal_cd, stem_cd from flower_species); 58691
Great! We’ve found that these four fields – color_cd, type_cd, petal_cd, stem_cd – are enough to identify a record. That type_cd looks a bit superfluous, though. Let’s try without it:
sqlite> select count(*) from (select distinct color_cd, petal_cd, stem_cd from flower_species); 58691
It turns out we only need three fields – color_cd, petal_cd, stem_cd – for this task.
distinct, we can also use
sqlite> select count(*) from (select color_cd, petal_cd, stem_cd from flower_species group by color_cd, petal_cd, stem_cd); 58691
To conclude: When referencing this table from another, it’s enough to refer to these three fields together in order to retrieve all the relevant information from the flower_species table.
How is this helpful? Let’s say there’s another table, of individual flower people. There is one unique identifier, which may be a natural or a surrogate key, with the three fields we saw in flower_species.
sqlite> select count(*) from flower_person; 3406469
3,406,469 persons of a flowery persuasion.
Now if we wish to retrieve additional data about a flower, we can make a join between the two tables:
select fp.flower_person_id, fp.color_cd, fp.petal_cd, fp.stem_cd, fs.leaf_cd, fs.root_cd, fs.env_cd, fs.temp_cd, fs.wtr_cd, fs.tree_cd, fs.ant_cd, fs.poem_cd from flower_person fp inner join flower_species fs on fp.color_cd = fs.color_cd and fp.petal_cd = fs.petal_cd and fp.stem_cd = fs.stem_cd and fp.flower_person_id = '111-99-0987';
This will result in a detailed row containing data from both tables for that particular flower person.
There are some other ways of making sense of an undocumented database, we’ll go over some anecdotal experiences in another post.