Friday, January 8, 2016

Gazetteer Database for Geographic Analysis

A couple of years ago, I had a tricky problem to solve. I inherited a tool a group of analysts were using to allocate website search based on ZIP code and location name (e.g., city, most commonly) for clients based on their own locations. The tool used the output of a predictive model for website search activity and inputs from the client, including addresses, for configuring the search locations that would be allocated for the client.

In addition to setting up relevant geographies based on the client's locations, the tool attempted to collect additional nearby locations that were likely relevant to the client (a "market"). The problem was that it did not find good matches for cities, towns, and other locations people were using on the website. As a result, the analysts were doing quite a bit of work to correct the output by removing and adding locations by hand. It was very time consuming, and I had to do something about it.

EDIT: I updated the following paragraph after I remembered how the algorithm was originally working. Initially I wrote that it calculated distances between locations, but it did not.

I reviewed the process and the data used to obtain location names. The algorithm used a simple lookup from ZIP code to location name, usually city or town. It did not attempt to look up nearby location names. The data did include latitude and longitude for the locations, so I thought I'd try adding code to lookup nearby locations with this data. I asked around in the software development area and found that they were using a fuzzy distance calculation based on a globe. When I tried it out using the existing location data, I found several problems. Some of the latitude/longitude coordinates were in the wrong state or in the middle of nowhere. Additionally, the data was missing quite a few relevant locations, like alternative names for cities and towns, as well as neighborhood names, parks, and a variety of other place names people use in web searches. I discovered it was several years out of date, and there was no chance it would be updated. So I decided the data was simply junk. I had to find a new source.

I began searching online for government sources of location information. After all, the US government establishes ZIP codes, city and town designations, and executes the census every once in a while. The US government also has to release this data publicly, according to law. (This doesn't mean it's free, or easy to obtain.) So there must be publicly-available data regarding locations. Luckily, I ended up finding a free online source: the US Gazetteer Files (see "Places" and "ZIP Code Tabulation Areas" sections).
What's a "gazetteer"? A gazetteer is a list of information about the locations on a map. In this case, the US Gazetteer data includes latitude and longitude, useful for geographic analysis.
As I used the data, I found a few gaps, so I searched again and found the US Board on Geographic Names (see "Populated Places" under "Topical Gazetteers"). By integrating these two data sets, I had a rather comprehensive listing of all sorts of places around the US.

Next, I had to get the new location data working with the search configuration tool. The tool was written with a web front-end for the inputs, SQL to collect the data and apply the inputs, and Excel as the output data. So I had to do a bit of ETL (actually, I did ELT, loading before transforming) to get the new location data working with the tool. I ended up designing the model pictured here:

The main data is in gz_place and gz_zip, storing locations and ZIP code data, respectively. On the right of gz_place are some lookup tables, including a table with alternative names (gz_name_xwalk - "xwalk" meaning crosswalk). The ZIP table references a master list of potential ZIP codes (see the prior post about creating that table), a list of invalid ZIP codes that showed up in the prior location data, and a list of ZIP codes I determined were "inside" other ZIP codes (the algorithm for is another discussion entirely).

The data on the left is a bit more interesting. There are some metadata tables not really connected to the rest (gz_metadata, gz_source), documenting quick facts about the data and where I found the data. Two reference tables also float off on their own, with a list of raw location names (gz_name) and a list of states (gz_state_51 - 51 to include DC), each including associated information.

Now I didn't want the tool to calculate distances between everything and everything else each time an analyst ran the tool, so I decided to precompute the distances and store only those within a certain proximity. I decided there were 3 types of distances required: ZIP to ZIP, location to location, and location to ZIP (and it could be used vice versa). To limit processing, I used a mapping of states and their neighbor states to connect the initial set of ZIPs and locations to use. This helped to decrease the run time. At the same time, I calculated the distances between each set of latitudes and longitudes, and retained only those within a certain number of miles. The final, filtered results are stored in gz_distance, with a lookup table describing the distance types (gz_distance_type).

Finally, I could get the better location data into the tool. I replaced the original code with new code that uses the new location data, doing a simple lookup of the locations specified by the client (ZIP codes) and filtering for an appropriate distance. I created a few new inputs to help the analyst tweak the distance that the tool would use to filter the crosswalk, with the idea that clients in rural areas may find a larger area more relevant, and clients in dense urban areas may find a smaller area more relevant.

The results were excellent. The analysts praised the new process for being more accurate, less time consuming, and easy to use. There were some manual aspects to the process, for example, correcting spelling errors entered by users on the website, but these would become less of an issue as time went by. (Especially the spelling errors. The website administrators were switching from one vendor data set to another, which had better location suggestions/requirements based on the user's input.) Overall, it was almost completely automated and only required updates once in a while when new locations were added.

This was one of those projects where I really enjoyed the autonomy I was given. I was simply given a task (make this tool work better), and given free reign over how to do that. I worked with many people to get their feedback and help, especially from the database maintainer and a few users for testing the new inputs on the tool. (One interesting thing I did with the database was to partition the gz_distance table based on distance type. I got help from the database maintainer on the best way to do that.) And best of all, I really enjoyed the project.

Friday, June 26, 2015

Evolving Desk

I previously wrote about my slightly unusual computer desk setup. I still use the same keyboard/mouse setup: a trackball mouse on the right, a regular mouse on the left (with a sticky note covering the laser so it doesn't move - it's used for scrolling and clicking), and the keyboard in the middle. I don't use the extra bluetooth mouse as much, since I've gotten used to being precise with the trackball mouse. (All the mice are still the same Logitech mice.) The USB extension cord is still there, too.

I have, however, upgraded quite a few other aspects of the desk. I upgraded the keyboard to a "tenkeyless tactile touch" keyboard from EliteKeyboards. It is missing the numeric pad (thus "tenkeyless") and it has special keys -- called tactile keys because they provide more feedback to the user when struck. It really feels so much different than the cheap keyboards people usually use.

The advantages of this keyboard include a smaller size, so I'm reaching less for the mice, and a better typing experience. The only disadvantages are that the keyboard was quite expensive (about $100) and I still need a numeric keypad - just not right on the same keyboard. So I also purchased a keypad that sits on the right side of the monitor table within reaching distance when it's needed.

I own the same monitor as before, but my new employer provided a monitor with a wider screen, so I use that one. It's nice having the wide screen for videos, but most of the time I don't use that much screen real estate. In fact, I have gotten use to keeping application non-maximized so I can see other applications at the same time or hiding in the background.

I built my own standing desk out of the old corner computer desk, which worked great for a while, until I got the new job working from home. At that point, I needed to re-evaluate the space requirements. I needed to be able to sit and write on occasion. I did some research and found an article suggesting a very cheap Ikea standing desk. I didn't have a desk, though (since I ripped it up to make the first standing desk), so I decided to buy a table to place it all on. I ended up getting a table with adjustable A-frame legs. I figured the A-frame would provide greater stability, especially given that I planned on getting a treadmill.

A few final adjustments to the desk: I used a dowel to lift up the front of the keyboard shelf to allow for a more natural resting place for the hands, I put some old textbooks under the monitor to raise it up to the correct height, and I recovered the old keyboard tray for the sitting position. Now the desk is good for sitting and standing at my work computer. The sitting option only has one mouse, mostly because I didn't want to spring for another $100 keyboard. You want the nice keyboard, you have to work for it by standing or walking.

You may have noticed the odd device just below the monitor, with a string coming down from it: It's the control for a treadmill by LifeSpan, along with the safety cord that stops it when pulled. I decided I wasn't working out enough, and I thought it would be great if I could use a treadmill while working. That arrived about 3 months ago, and I had to raise up the desk to accommodate its height. Here's the complete setup:

The work laptop is over on the left side of the table, and when it is flipped open, it can be used while sitting. I usually do this when I'm tired or I have a meeting to attend (standing or walking at the treadmill is too distracting for all parties while on a call). My home computer (the tower behind the laptop) is only connected to the standing desk monitor, so I don't have a choice but to stand at that one. I suppose I could use my old monitor to figure out a sitting situation, but I haven't really needed to sit at my home computer. I have a KVM switch to toggle the computers, and the switch is just under the monitor.

I'm still getting used to standing and/or walking while working, mostly the impact on my body. I have not found it difficult to many tasks while standing or walking. The only exception is, as I mentioned above, phone calls or meetings. Usually I want to take notes, so that's easier while sitting. I'm not used to standing or walking for hours on end, so when I get tired, I sit.

One of my colleagues asked how I was able to walk and work at the same time. It's not too difficult. As proof, here's a short and incredibly dull video of me walking while working:

From Chris's Album

In the video, I did the following tasks, not exactly in this order:
  • Wrote some code
  • Ran a command on the Linux command line
  • Reviewed the output from the above process
  • Reviewed a file of health-care-related records
  • Read some code
  • Thought about it for a bit
  • Figured out why a file had duplicates
  • Wrote an email
  • Drank some water
I was going 2 miles per hour (the treadmill ranges from 0.1 to 4 miles per hour, in 1-tenth of a mile increments). I find it is nice to back up a step while reading or thinking in order to walk more naturally. I would also recommend using a sports bottle. I made the mistake of using an open-topped cup, which could be spilled quite easily on any of the computer components or the treadmill.

I would highly recommend this arrangement and all of these products. The mice from Logitech, keyboard from EliteKeyboards, assorted desk stuff from Ikea, and the treadmill from LifeSpan. It makes for a good way to work and get a bit of exercise.

Wednesday, May 20, 2015

KeePass2 and Gmail

It's been a while since I posted, but with a good cause: We just had our second child a few weeks ago!

The other day, Google decided to change Gmail's login screen to be in three parts (two for those without two-step authentication): 1. Username 2. Password 3. Two-Step Authentication. This is annoying because I use KeePass2 with auto-type, and the new pages interfere with the auto-typing mechanism.

Today I decided to solve that issue, and I'm posting it here to share with anyone else who may also have the same problem. Here are the steps:

  1. Open up the Gmail or Google Apps email account you want to change ("Edit/View Entry" on the context menu, or hit Enter when the entry is selected).
  2. Go to the Auto-Type tab.
  3. Select "Override default sequence" option.
  4. The field should initial have (without quotes):


    Replace that with (without quotes):


  5. Test in your favorite browser. Adjust the "{DELAY mmmm}" parameter by replacing "mmmm" with a different numeric value. This is in milliseconds, so a value of 1000 is 1 second.

The "{DELAY 2000}" parameter is the key to fixing the issue. Since there is a new page in between the username and password, we need a delay to let the page load before typing the password.

For more information on the auto-type feature and the parameters that can be used, see the KeePass documentation here:

Don't forget to tag your Gmail account as an OpenSSL account (add "OpenSSL" to the description) while you're at it!

Tuesday, November 25, 2014

Strange SAS Error Message

I spent far too long trying to debug a strange error message in SAS. The solution ended up being aggravatingly simple, but arriving at the solution was not. So I wanted to share the problem and solution. (I discovered this issue using SAS 9.1.3.)

MPRINT(TRANSPOSE): proc datasets nolist;
SYMBOLGEN: Macro variable OUT resolves to lib.table_name
NOTE: Line generated by the invoked macro "TRANSPOSE".
161 modify &OUT; label &NEWVAR = "&NEWLBL"; quit;

MPRINT(TRANSPOSE): modify lib.table_name;
NOTE: Line generated by the macro variable "OUT".
161 lib.table_name
NOTE: Enter RUN; to continue or QUIT; to end the procedure.

SYMBOLGEN: Macro variable NEWVAR resolves to count
SYMBOLGEN: Macro variable NEWLBL resolves to count: prevalence
MPRINT(TRANSPOSE): label count = "count: prevalence";

ERROR 200-322: The symbol is not recognized and will be ignored.
ERROR 22-322: Expecting a name.

NOTE: Statements not processed because of errors noted above.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: SAS set option OBS=0 and will continue to check statements. This may cause NOTE: No observations in data set.
NOTE: PROCEDURE DATASETS used (Total process time):
      real time 0.00 seconds
      cpu time 0.00 seconds

If you start looking where the error statements start and look down, you'll miss the true source of the error.

So for those not familiar with PROC DATASETS, one of the options on the PROC DATASETS statement is the LIBRARY= option. It defaults to WORK. So this PROC DATASETS is looking only in WORK for the dataset lib.table_name. Since "lib." is not a valid part of a table name (and just the table name), it's not parsing it correctly, resulting in an error.

The error seems to cause the SAS code parser to trip and read the rest of the code incorrectly, and it looks like the following step is where the error occurs. However, it's actually the earlier step ("modify &OUT" or the resolved code, "modify lib.table_name").

The lesson is twofold: 1) Don't try to use an output library with the table name (e.g., "lib.table_name") when the LIBRARY= option on PROC DATASETS is not used. 2) Don't write a macro where an output data set can be specified (e.g., OUT=) and use PROC DATASETS without checking whether the user specified an output library. Alternatively, don't use PROC DATASETS. (There are some advantages to PROC DATASETS, so don't throw the baby out with the bathwater!)

I wasted a lot of time on this since I thought the macro could handle any type of input/output specified, but apparently that is not true. The solution for me was to remove the "lib." from the OUT= argument to the macro ("&OUT"). Since PROC DATSETS was looking in work, it found it after I made this "correction".

Thursday, May 22, 2014

A Great Quote and Idea

I was browsing through Quora today and found this quote, on the topic of What are the top 10 things that we should be informed about in life?:

"Bear in mind... that your opponent in any debate is not the other person, but ignorance."
– Justin Freeman, Source

What a great quote. It embodies many lessons all in one. Don't focus on the person you're arguing with. In fact, don't argue. Persuade them away from ignorance. Brilliant quote.