Friday, January 8, 2016

Gazetteer Database for Geographic Analysis

A couple of years ago, I had a tricky problem to solve. I inherited a tool a group of analysts were using to allocate website search based on ZIP code and location name (e.g., city, most commonly) for clients based on their own locations. The tool used the output of a predictive model for website search activity and inputs from the client, including addresses, for configuring the search locations that would be allocated for the client.

In addition to setting up relevant geographies based on the client's locations, the tool attempted to collect additional nearby locations that were likely relevant to the client (a "market"). The problem was that it did not find good matches for cities, towns, and other locations people were using on the website. As a result, the analysts were doing quite a bit of work to correct the output by removing and adding locations by hand. It was very time consuming, and I had to do something about it.

EDIT: I updated the following paragraph after I remembered how the algorithm was originally working. Initially I wrote that it calculated distances between locations, but it did not.

I reviewed the process and the data used to obtain location names. The algorithm used a simple lookup from ZIP code to location name, usually city or town. It did not attempt to look up nearby location names. The data did include latitude and longitude for the locations, so I thought I'd try adding code to lookup nearby locations with this data. I asked around in the software development area and found that they were using a fuzzy distance calculation based on a globe. When I tried it out using the existing location data, I found several problems. Some of the latitude/longitude coordinates were in the wrong state or in the middle of nowhere. Additionally, the data was missing quite a few relevant locations, like alternative names for cities and towns, as well as neighborhood names, parks, and a variety of other place names people use in web searches. I discovered it was several years out of date, and there was no chance it would be updated. So I decided the data was simply junk. I had to find a new source.

I began searching online for government sources of location information. After all, the US government establishes ZIP codes, city and town designations, and executes the census every once in a while. The US government also has to release this data publicly, according to law. (This doesn't mean it's free, or easy to obtain.) So there must be publicly-available data regarding locations. Luckily, I ended up finding a free online source: the US Gazetteer Files (see "Places" and "ZIP Code Tabulation Areas" sections).
What's a "gazetteer"? A gazetteer is a list of information about the locations on a map. In this case, the US Gazetteer data includes latitude and longitude, useful for geographic analysis.
As I used the data, I found a few gaps, so I searched again and found the US Board on Geographic Names (see "Populated Places" under "Topical Gazetteers"). By integrating these two data sets, I had a rather comprehensive listing of all sorts of places around the US.

Next, I had to get the new location data working with the search configuration tool. The tool was written with a web front-end for the inputs, SQL to collect the data and apply the inputs, and Excel as the output data. So I had to do a bit of ETL (actually, I did ELT, loading before transforming) to get the new location data working with the tool. I ended up designing the model pictured here:

The main data is in gz_place and gz_zip, storing locations and ZIP code data, respectively. On the right of gz_place are some lookup tables, including a table with alternative names (gz_name_xwalk - "xwalk" meaning crosswalk). The ZIP table references a master list of potential ZIP codes (see the prior post about creating that table), a list of invalid ZIP codes that showed up in the prior location data, and a list of ZIP codes I determined were "inside" other ZIP codes (the algorithm for is another discussion entirely).

The data on the left is a bit more interesting. There are some metadata tables not really connected to the rest (gz_metadata, gz_source), documenting quick facts about the data and where I found the data. Two reference tables also float off on their own, with a list of raw location names (gz_name) and a list of states (gz_state_51 - 51 to include DC), each including associated information.

Now I didn't want the tool to calculate distances between everything and everything else each time an analyst ran the tool, so I decided to precompute the distances and store only those within a certain proximity. I decided there were 3 types of distances required: ZIP to ZIP, location to location, and location to ZIP (and it could be used vice versa). To limit processing, I used a mapping of states and their neighbor states to connect the initial set of ZIPs and locations to use. This helped to decrease the run time. At the same time, I calculated the distances between each set of latitudes and longitudes, and retained only those within a certain number of miles. The final, filtered results are stored in gz_distance, with a lookup table describing the distance types (gz_distance_type).

Finally, I could get the better location data into the tool. I replaced the original code with new code that uses the new location data, doing a simple lookup of the locations specified by the client (ZIP codes) and filtering for an appropriate distance. I created a few new inputs to help the analyst tweak the distance that the tool would use to filter the crosswalk, with the idea that clients in rural areas may find a larger area more relevant, and clients in dense urban areas may find a smaller area more relevant.

The results were excellent. The analysts praised the new process for being more accurate, less time consuming, and easy to use. There were some manual aspects to the process, for example, correcting spelling errors entered by users on the website, but these would become less of an issue as time went by. (Especially the spelling errors. The website administrators were switching from one vendor data set to another, which had better location suggestions/requirements based on the user's input.) Overall, it was almost completely automated and only required updates once in a while when new locations were added.

This was one of those projects where I really enjoyed the autonomy I was given. I was simply given a task (make this tool work better), and given free reign over how to do that. I worked with many people to get their feedback and help, especially from the database maintainer and a few users for testing the new inputs on the tool. (One interesting thing I did with the database was to partition the gz_distance table based on distance type. I got help from the database maintainer on the best way to do that.) And best of all, I really enjoyed the project.