Let’s say we download a bunch of public domain data that we’re considering for use in a service we’re building. Like so much of the data grudgingly made available to the public by government organizations, the data comes with no ICD, no comments, no accompanying documents at all. The only thing we know are the general domain (let’s say – species of flowers) and the field names. We don’t know what the primary key is, or even if any single field can act as the primary key. In this post we’ll use a locally installed sqlite database, perfect for small and quick experimentation. For larger apps postgresql is recommended).
This is what flower_species looks like (completely fanciful – like how often the flower has been mentioned in poetry – these are not really from any source):
color _cd | petal _cd | stem _cd | leaf _cd | root _cd | env _cd | temp _cd | wtr _cd | tree _cd | ant _cd | poem _cd |
701 | 6222 | 77002 | 34 | 774 | 17 | 30 | 453 | 22 | 87 | 5 |
232 | 1276 | 50505 | 32 | 754 | 12 | 15 | 34 | 71 | 65 | 0 |
701 | 5111 | 90109 | 54 | 123 | 65 | 18 | 66 | 63 | 300 | 17 |
… | … | … | … | … | … | … | … | … | … | … |
First things first – let’s see what uniquely identifies each record.
Let’s start out by seeing how many records there are:
sqlite> select count(*) from flower_species;
58691
Okay, so there are 58, 691 records all in all.
There are also a lot of fields for each record. They currently don’t mean much to us, but what we want to know right now is just which fields, when evaluated together, can uniquely identify a record. One way of doing this is by using the distinct
keyword.
sqlite> select count() from (select distinct petal_cd from flower_species);
2540
sqlite> select count() from (select distinct stem_cd from flower_species);
218
Now let’s try stems and petals together:
sqlite> select count() from (select distinct petal_cd, stem_cd from flower_species);
28356
Much better! We’ll throw in a couple more fields:
sqlite> select count() from (select distinct type_cd, petal_cd, stem_cd from flower_species);
30927
sqlite> select count(*) from (select distinct color_cd, type_cd, petal_cd, stem_cd from flower_species);
58691
Great! We’ve found that these four fields – color_cd, type_cd, petal_cd, stem_cd – are enough to identify a record. That type_cd looks a bit superfluous, though. Let’s try without it:
sqlite> select count(*) from (select distinct color_cd, petal_cd, stem_cd from flower_species);
58691
It turns out we only need three fields – color_cd, petal_cd, stem_cd – for this task.
Instead of distinct
, we can also use group by
:
sqlite> select count(*) from (select color_cd, petal_cd, stem_cd from flower_species group by color_cd, petal_cd, stem_cd);
58691
To conclude: When referencing this table from another, it’s enough to refer to these three fields together in order to retrieve all the relevant information from the flower_species table.
How is this helpful? Let’s say there’s another table, of individual flower people. There is one unique identifier, which may be a natural or a surrogate key, with the three fields we saw in flower_species.
flower_person_id | color_cd | petal_cd | stem_cd |
111-99-0987 | 701 | 5111 | 90109 |
212-98-9876 | 265 | 4567 | 80208 |
313-97-8765 | 986 | 5111 | 70307 |
414-96-7654 | 845 | 6547 | 60406 |
515-95-6543 | 232 | 1276 | 50505 |
616-94-5432 | 967 | 4576 | 40604 |
717-93-4321 | 701 | 0978 | 30703 |
818-92-3210 | 265 | 3586 | 20802 |
919-91-2109 | 701 | 6547 | 10901 |
… | … | … | … |
sqlite> select count(*) from flower_person;
3406469
3,406,469 persons of a flowery persuasion.
Now if we wish to retrieve additional data about a flower, we can make a join between the two tables:
select
fp.flower_person_id, fp.color_cd, fp.petal_cd, fp.stem_cd, fs.leaf_cd, fs.root_cd, fs.env_cd, fs.temp_cd, fs.wtr_cd, fs.tree_cd, fs.ant_cd, fs.poem_cd
from
flower_person fp inner join flower_species fs on
fp.color_cd = fs.color_cd and fp.petal_cd = fs.petal_cd and fp.stem_cd = fs.stem_cd and fp.flower_person_id = '111-99-0987';
This will result in a detailed row containing data from both tables for that particular flower person.
There are some other ways of making sense of an undocumented database, we’ll go over some anecdotal experiences in another post.