Making sense of a new table of open data in your database

Let’s say we download a bunch of public domain data that we’re considering for use in a service we’re building. Like so much of the data grudgingly made available to the public by government organizations, the data comes with no ICD, no comments, no accompanying documents at all. The only thing we know are the general domain (let’s say – species of flowers) and the field names. We don’t know what the primary key is, or even if any single field can act as the primary key. In this post we’ll use a locally installed sqlite database, perfect for small and quick experimentation. For larger apps postgresql is recommended).

This is what flower_species looks like (completely fanciful – like how often the flower has been mentioned in poetry – these are not really from any source):

color
_cd
petal
_cd
stem
_cd
leaf
_cd
root
_cd
env
_cd
temp
_cd
wtr
_cd
tree
_cd
ant
_cd
poem
_cd
70162227700234774173045322875
2321276505053275412153471650
701511190109541236518666330017
flower_species. Contains details for each species

First things first – let’s see what uniquely identifies each record.

Let’s start out by seeing how many records there are:

sqlite> select count(*) from flower_species;
58691

Okay, so there are 58, 691 records all in all.

There are also a lot of fields for each record. They currently don’t mean much to us, but what we want to know right now is just which fields, when evaluated together, can uniquely identify a record. One way of doing this is by using the distinct keyword.

sqlite> select count() from (select distinct petal_cd from flower_species);
2540 
sqlite> select count() from (select distinct stem_cd from flower_species);
218

Now let’s try stems and petals together:

sqlite> select count() from (select distinct petal_cd, stem_cd from flower_species);
28356 

Much better! We’ll throw in a couple more fields:

sqlite> select count() from (select distinct type_cd, petal_cd, stem_cd from flower_species);
30927
sqlite> select count(*) from (select distinct color_cd, type_cd, petal_cd, stem_cd from flower_species);
58691

Great! We’ve found that these four fields – color_cd, type_cd, petal_cd, stem_cd – are enough to identify a record. That type_cd looks a bit superfluous, though. Let’s try without it:

sqlite> select count(*) from (select distinct color_cd, petal_cd, stem_cd from flower_species);
58691

It turns out we only need three fields – color_cd, petal_cd, stem_cd – for this task.

Instead of distinct, we can also use group by:

sqlite> select count(*) from (select color_cd, petal_cd, stem_cd from flower_species group by color_cd, petal_cd, stem_cd);
58691

To conclude: When referencing this table from another, it’s enough to refer to these three fields together in order to retrieve all the relevant information from the flower_species table.

How is this helpful? Let’s say there’s another table, of individual flower people. There is one unique identifier, which may be a natural or a surrogate key, with the three fields we saw in flower_species.

flower_person_idcolor_cdpetal_cdstem_cd
111-99-0987701511190109
212-98-9876265456780208
313-97-8765986511170307
414-96-7654845654760406
515-95-6543232127650505
616-94-5432967457640604
717-93-4321701097830703
818-92-3210265358620802
919-91-2109701654710901
flower_person. Contains information for individual flowers
sqlite> select count(*) from flower_person;
3406469

3,406,469 persons of a flowery persuasion.

Now if we wish to retrieve additional data about a flower, we can make a join between the two tables:

select 
fp.flower_person_id, fp.color_cd, fp.petal_cd, fp.stem_cd, fs.leaf_cd, fs.root_cd, fs.env_cd, fs.temp_cd, fs.wtr_cd, fs.tree_cd, fs.ant_cd, fs.poem_cd
from
flower_person fp inner join flower_species fs on
fp.color_cd = fs.color_cd and fp.petal_cd = fs.petal_cd and fp.stem_cd = fs.stem_cd and fp.flower_person_id = '111-99-0987';

This will result in a detailed row containing data from both tables for that particular flower person.

There are some other ways of making sense of an undocumented database, we’ll go over some anecdotal experiences in another post.

Leave a Reply

Your email address will not be published. Required fields are marked *