Making sense of a new table of open data in your database

Let’s say we download a bunch of public domain data that we’re considering for use in a service we’re building. Like so much of the data grudgingly made available to the public by government organizations, the data comes with no ICD, no comments, no accompanying documents at all. The only thing we know are the general domain (let’s say – species of flowers) and the field names. We don’t know what the primary key is, or even if any single field can act as the primary key. In this post we’ll use a locally installed sqlite database, perfect for small and quick experimentation. For larger apps postgresql is recommended).

This is what flower_species looks like (completely fanciful – like how often the flower has been mentioned in poetry – these are not really from any source):

color _cd	petal _cd	stem _cd	leaf _cd	root _cd	env _cd	temp _cd	wtr _cd	tree _cd	ant _cd	poem _cd
701	6222	77002	34	774	17	30	453	22	87	5
232	1276	50505	32	754	12	15	34	71	65	0
701	5111	90109	54	123	65	18	66	63	300	17
…	…	…	…	…	…	…	…	…	…	…

flower_species. Contains details for each species

First things first – let’s see what uniquely identifies each record.

Let’s start out by seeing how many records there are:

sqlite> select count(*) from flower_species;
58691

Okay, so there are 58, 691 records all in all.

There are also a lot of fields for each record. They currently don’t mean much to us, but what we want to know right now is just which fields, when evaluated together, can uniquely identify a record. One way of doing this is by using the distinct keyword.

sqlite> select count() from (select distinct petal_cd from flower_species);
2540 
sqlite> select count() from (select distinct stem_cd from flower_species);
218

Now let’s try stems and petals together:

sqlite> select count() from (select distinct petal_cd, stem_cd from flower_species);
28356

Much better! We’ll throw in a couple more fields:

sqlite> select count() from (select distinct type_cd, petal_cd, stem_cd from flower_species);
30927
sqlite> select count(*) from (select distinct color_cd, type_cd, petal_cd, stem_cd from flower_species);
58691

Great! We’ve found that these four fields – color_cd, type_cd, petal_cd, stem_cd – are enough to identify a record. That type_cd looks a bit superfluous, though. Let’s try without it:

sqlite> select count(*) from (select distinct color_cd, petal_cd, stem_cd from flower_species);
58691

It turns out we only need three fields – color_cd, petal_cd, stem_cd – for this task.

Instead of distinct, we can also use group by:

sqlite> select count(*) from (select color_cd, petal_cd, stem_cd from flower_species group by color_cd, petal_cd, stem_cd);
58691

To conclude: When referencing this table from another, it’s enough to refer to these three fields together in order to retrieve all the relevant information from the flower_species table.

How is this helpful? Let’s say there’s another table, of individual flower people. There is one unique identifier, which may be a natural or a surrogate key, with the three fields we saw in flower_species.

flower_person_id	color_cd	petal_cd	stem_cd
111-99-0987	701	5111	90109
212-98-9876	265	4567	80208
313-97-8765	986	5111	70307
414-96-7654	845	6547	60406
515-95-6543	232	1276	50505
616-94-5432	967	4576	40604
717-93-4321	701	0978	30703
818-92-3210	265	3586	20802
919-91-2109	701	6547	10901
…	…	…	…

flower_person. Contains information for individual flowers

sqlite> select count(*) from flower_person;
3406469

3,406,469 persons of a flowery persuasion.

Now if we wish to retrieve additional data about a flower, we can make a join between the two tables:

select 
fp.flower_person_id, fp.color_cd, fp.petal_cd, fp.stem_cd, fs.leaf_cd, fs.root_cd, fs.env_cd, fs.temp_cd, fs.wtr_cd, fs.tree_cd, fs.ant_cd, fs.poem_cd
from
flower_person fp inner join flower_species fs on
fp.color_cd = fs.color_cd and fp.petal_cd = fs.petal_cd and fp.stem_cd = fs.stem_cd and fp.flower_person_id = '111-99-0987';

This will result in a detailed row containing data from both tables for that particular flower person.

There are some other ways of making sense of an undocumented database, we’ll go over some anecdotal experiences in another post.

Danielsq

Making sense of a new table of open data in your database

Leave a Reply Cancel reply