Wednesday, January 29, 2014

Bad SQL Writing Put to Good Use

There's a certain style of writing SQL that I really don't like. Here's an example that pulls names and addresses for people in Wisconsin and Illinois:

select p.name, a.address
from person p, person_address pa, address a
where p.person_id = pa.person_id
and pa.address_id = a.address_id
and a.state in ('WI', 'IL')
;


Basically, the author has stacked all the tables into the FROM statement, and specified how they join on the WHERE statement. This creates confusion about how the tables are intended to be joined as well as mixing actual filter criteria with the join conditions. However, it works, since the code results in inner joins between all tables, and that was okay.

My preferred style is like so:

select p.name, a.address
from person p
left join person_address pa
on p.person_id = pa.person_id
inner join address a
on pa.address_id = a.address_id
where a.state in ('WI', 'IL')
;


It's a bit more verbose, but that helps the reader. This style splits out the tables into different statements and results in clearly indicated join types and join fields. It is clear to a reader of the code the intention of joining each table. The filter criteria are located in the WHERE statement without any other statements to confuse them with. There are still cases where the result may not be as expected based on the filter criteria, but it's easier to debug.

Overall, the first example is a confusing style to use, and it can cause trouble if the joins were intended to be outer joins and were not, because the style does not have a way to specify outer joins. (LEFT JOIN is short for LEFT OUTER JOIN, which means, basically, return all records from the first table, and any data that matches in the next without missing any records from the first.)

The other day, though, I encountered a great way to use this potentially error-prone style in a way that is actually very useful.

I wanted to generate a master list of all potential ZIP codes in the US, and then filter out ones that are not in use or are otherwise invalid. I started by creating a small table with 10 rows that consist of 1 column with the numbers 0-9.  With no loop statements available in SQL, I wrote this table like so:

create temporary table num (n int);
insert into num values (0);
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
delete from num where n > 9;


That last statement is just in case I ran it too many times. I was even too lazy to write out 1-9, instead just repeating the max+1 code 9 times. It's a bit over-the-top, but it works.

To get my "master" list of ZIP codes, I joined the table to itself 5 times, one for each character in the ZIP code (because they can start with 0s, they should be treated as characters, not numbers!). Here's how:

create table master_zip as
select n1.n||n2.n||n3.n||n4.n||n5.n as zip
from num n1, num n2, num n3, num n4, num n5
order by 1
;


Simple, isn't it? Essentially, this just makes a big Cartesian product of the table with itself times 4. There's no WHERE statement, because there's no need to join on anything. (If a system required it, I would just write "where 1=1".) This generates 100,000 records. That's about 58,000 too many, according the US Postal Service, so we need to delete some of those that are not in use. But that process is for another time.

Using a style that I usually do not recommend or warn against was interesting and useful, but it requires knowing a bit more about how these things work. Had I run this last bit of code on a larger data set, it would have caused a lot of problems, like running out of disk space or RAM. So use this style sparingly and carefully!