Wednesday, September 13, 2017

Tough Year

It's been a tough year. I covered the beginning of the year in another post regarding the Calculus course I took in preparation for applying to a master's program in applied statistics. I had planned on studying for the GRE this summer and taking it this fall as well as applying for the master's program, then working through the master's program for the next 3-5 years. My employer, HDMS (a subsidiary of Aetna), was going to help pay for the degree. However, those plans were going to get thrown off course.

I had just finished submitting my coursework for reimbursement when I was sent an out-of-place meeting request. At the meeting, I found out HDMS was letting me go. At first it appeared it was just me, but as I found out later in the day, they were laying off about 10 other employees and closing about 10 other open positions. For two hours, I was trying to figure out what I did wrong - but it had nothing to do with my performance. I was the most recent hire on the team, and there were others getting laid off too.

I hit the ground running. The severance package included a career consulting service - I looked it up and scheduled time to review my resume with a consultant. The paper / PDF version has undergone quite a few revisions over the past few weeks. I also started contacting my network, browsing through online job boards, and all the usual job-hunting tasks. I did find quite a few roles through my network, and a couple of them resulted in offers.

I quickly found a role as a consultant data analyst with Great Wolf Resorts. I stayed there for a few weeks, working on a single project integrating credit card transactions with the reservations. Essentially, if a guest uses a credit card for something on site (e.g., restaurant) and does not charge it back to the room, the transaction is not connected to the reservation. In order to connect the two, I had to merge transactions with reservations based on guest name and, if available, the last four digits of the credit card number. It was messy, and I was able to get about 57% of the transactions matched. I believe the best possible rate was somewhere around 65%, but it would have required a lot of exception handling, manual matching, and/or time-intensive matching processes (e.g., matching text within another text field). The company analyst and I decided the additional matches weren't worth the expense.

The position was a good fit for my skills, and I enjoyed working with the people there, but as a consultant role, the benefits were very expensive and of course it could have ended at any time. So, I kept looking for permanent roles while I was there. I need something more permanent right now, but I can definitely see myself as a successful consultant. In my short time there, I think I demonstrated a lot of value with my skills and the process and analysis I left behind.

Recently, I found a new role as a Senior Healthcare Analyst at SSM Health, a non-profit healthcare organization with hospitals from Wisconsin to Missouri. They also own Dean Health Plan, where I worked a few years ago. I still know a few people there, so it will be good to reconnect with them. I'll be analyzing healthcare data for a particular region of the system, starting in a few days. I feel very good about the team and the leader, so I'm looking forward to getting started. Luckily, I'll be working from home again, so I'll get to use my treadmill again.

Here's hoping quarter four is quite a bit less turbulent!

Tuesday, August 1, 2017

Calculus III

In the last year or so, I decided to apply for a master's program in applied statistics, but I was missing one of the prerequisite mathematics courses: Calculus III. I had taken calculus courses in high school and college, but those courses were more focused on applications. Furthermore, I hadn't covered any of Calculus II in those courses.

Instead of taking the entire series, which would have taken quite a bit of time and money, I decided to do something rather daunting: I took Calculus III online and used Khan Academy and other sources to catch up on Calculus I and II. I read reviews that the first few weeks were tough even if you had taken Calculus I and II just before III. Undeterred, I started the course earlier this year, and the first few weeks were indeed tough.

The online program I used - NetMath - used an online math tool for running code and submitting homework. Each student is assigned to a mentor who grades assignments, answers questions, and ensures each student is on schedule. Students receive feedback on their homework and are able to re-submit corrections a couple of times. The two midterms and final must be taken in person with a proctor.

My first mentor was not very responsive. On week 2, a critical week in the program, my mentor did not respond to emails or grade my assignments in a timely fashion (within 3 days, as noted in the program handbook). I notified the program administrators and they assigned me a new mentor. She had quite a bit of catching up to do, but she did her best and eventually graded the outstanding assignments and responded to my questions. Honestly, she was amazing, and I'd write her a letter of recommendation if she asked.

Lesson 2 is quite difficult. It's really the first lesson on the topic of the course, where lesson 1 was review of parametric equations and other necessary concepts, and it's there in case anyone missed or forgot these topics. With the combination of difficult content and slow responses from my mentor, it took me 2 weeks to finish lesson 2. In addition, I got sick for a couple of days and there was a death in the family, which put me behind another 2 weeks or so. Fortunately, the program offers a two-month extension, and I planned on using it if needed.

However, there were additional, serious problems with the course. One of the most grievous was incomplete or incorrect content. There were often no terms given to ideas, preventing students who have taken this course from communicating the concepts effectively. For example, vector projection was just called "vector push on another vector". It took me quite some time to find the right term to be able to research this concept online.

The course also neglected saddle points and claimed that whenever a gradient was {0, 0} (or more zeros depending on the number of dimensions), that the point was a minimum or maximum of the function. This is blatantly not true when a saddle point is present, and it would be terrible for students to internalize this falsity since it is profoundly meaningful in calculating predictive models with machine learning, specifically neural networks. You can't assume you've optimized a function when the gradient is {0, 0} without looking around it to see if you've found a saddle point.

All told, I was quite unhappy with this course. Not only was I spending a lot of time catching up, but I was spending time trying to learn the material through other sources since the course material was incomplete or inaccurate. Nearing the end of the course, I was able to catch up to the point where it looked like I could finish the course if I just had another week or two. I emailed the administrators for a course extension, noting the reasons I had for the delay and the issues I had with the course, and they offered a shorter extension so I could finish it without rushing and without taking the full extension. (My schedule to finish without an extension would have been very demanding for my mentor to complete all the grading in time.)

Despite all the delays and issues, I finished all the material within the original time frame, and I just needed to take the final. I studied for a few extra days and took the final about 2 weeks after the course originally ended. Since the final was comprehensive and included the last three lessons, I was very nervous about it. I had aced the midterms, but there was just so much to remember (e.g., the curl of a 3D field is difficult to remember). To my great surprise and delight, I not only aced the final, I got an A+ in the course. I was relieved!

Now I just have to re-take the GRE and apply for the master's program.

Friday, January 8, 2016

Gazetteer Database for Geographic Analysis

A couple of years ago, I had a tricky problem to solve. I inherited a tool a group of analysts were using to allocate website search based on ZIP code and location name (e.g., city, most commonly) for clients based on their own locations. The tool used the output of a predictive model for website search activity and inputs from the client, including addresses, for configuring the search locations that would be allocated for the client.

In addition to setting up relevant geographies based on the client's locations, the tool attempted to collect additional nearby locations that were likely relevant to the client (a "market"). The problem was that it did not find good matches for cities, towns, and other locations people were using on the website. As a result, the analysts were doing quite a bit of work to correct the output by removing and adding locations by hand. It was very time consuming, and I had to do something about it.

EDIT: I updated the following paragraph after I remembered how the algorithm was originally working. Initially I wrote that it calculated distances between locations, but it did not.

I reviewed the process and the data used to obtain location names. The algorithm used a simple lookup from ZIP code to location name, usually city or town. It did not attempt to look up nearby location names. The data did include latitude and longitude for the locations, so I thought I'd try adding code to lookup nearby locations with this data. I asked around in the software development area and found that they were using a fuzzy distance calculation based on a globe. When I tried it out using the existing location data, I found several problems. Some of the latitude/longitude coordinates were in the wrong state or in the middle of nowhere. Additionally, the data was missing quite a few relevant locations, like alternative names for cities and towns, as well as neighborhood names, parks, and a variety of other place names people use in web searches. I discovered it was several years out of date, and there was no chance it would be updated. So I decided the data was simply junk. I had to find a new source.

I began searching online for government sources of location information. After all, the US government establishes ZIP codes, city and town designations, and executes the census every once in a while. The US government also has to release this data publicly, according to law. (This doesn't mean it's free, or easy to obtain.) So there must be publicly-available data regarding locations. Luckily, I ended up finding a free online source: the US Gazetteer Files (see "Places" and "ZIP Code Tabulation Areas" sections).
What's a "gazetteer"? A gazetteer is a list of information about the locations on a map. In this case, the US Gazetteer data includes latitude and longitude, useful for geographic analysis.
As I used the data, I found a few gaps, so I searched again and found the US Board on Geographic Names (see "Populated Places" under "Topical Gazetteers"). By integrating these two data sets, I had a rather comprehensive listing of all sorts of places around the US.

Next, I had to get the new location data working with the search configuration tool. The tool was written with a web front-end for the inputs, SQL to collect the data and apply the inputs, and Excel as the output data. So I had to do a bit of ETL (actually, I did ELT, loading before transforming) to get the new location data working with the tool. I ended up designing the model pictured here:



The main data is in gz_place and gz_zip, storing locations and ZIP code data, respectively. On the right of gz_place are some lookup tables, including a table with alternative names (gz_name_xwalk - "xwalk" meaning crosswalk). The ZIP table references a master list of potential ZIP codes (see the prior post about creating that table), a list of invalid ZIP codes that showed up in the prior location data, and a list of ZIP codes I determined were "inside" other ZIP codes (the algorithm for is another discussion entirely).

The data on the left is a bit more interesting. There are some metadata tables not really connected to the rest (gz_metadata, gz_source), documenting quick facts about the data and where I found the data. Two reference tables also float off on their own, with a list of raw location names (gz_name) and a list of states (gz_state_51 - 51 to include DC), each including associated information.

Now I didn't want the tool to calculate distances between everything and everything else each time an analyst ran the tool, so I decided to precompute the distances and store only those within a certain proximity. I decided there were 3 types of distances required: ZIP to ZIP, location to location, and location to ZIP (and it could be used vice versa). To limit processing, I used a mapping of states and their neighbor states to connect the initial set of ZIPs and locations to use. This helped to decrease the run time. At the same time, I calculated the distances between each set of latitudes and longitudes, and retained only those within a certain number of miles. The final, filtered results are stored in gz_distance, with a lookup table describing the distance types (gz_distance_type).

Finally, I could get the better location data into the tool. I replaced the original code with new code that uses the new location data, doing a simple lookup of the locations specified by the client (ZIP codes) and filtering for an appropriate distance. I created a few new inputs to help the analyst tweak the distance that the tool would use to filter the crosswalk, with the idea that clients in rural areas may find a larger area more relevant, and clients in dense urban areas may find a smaller area more relevant.

The results were excellent. The analysts praised the new process for being more accurate, less time consuming, and easy to use. There were some manual aspects to the process, for example, correcting spelling errors entered by users on the website, but these would become less of an issue as time went by. (Especially the spelling errors. The website administrators were switching from one vendor data set to another, which had better location suggestions/requirements based on the user's input.) Overall, it was almost completely automated and only required updates once in a while when new locations were added.

This was one of those projects where I really enjoyed the autonomy I was given. I was simply given a task (make this tool work better), and given free reign over how to do that. I worked with many people to get their feedback and help, especially from the database maintainer and a few users for testing the new inputs on the tool. (One interesting thing I did with the database was to partition the gz_distance table based on distance type. I got help from the database maintainer on the best way to do that.) And best of all, I really enjoyed the project.

Friday, June 26, 2015

Evolving Desk

I previously wrote about my slightly unusual computer desk setup. I still use the same keyboard/mouse setup: a trackball mouse on the right, a regular mouse on the left (with a sticky note covering the laser so it doesn't move - it's used for scrolling and clicking), and the keyboard in the middle. I don't use the extra bluetooth mouse as much, since I've gotten used to being precise with the trackball mouse. (All the mice are still the same Logitech mice.) The USB extension cord is still there, too.



I have, however, upgraded quite a few other aspects of the desk. I upgraded the keyboard to a "tenkeyless tactile touch" keyboard from EliteKeyboards. It is missing the numeric pad (thus "tenkeyless") and it has special keys -- called tactile keys because they provide more feedback to the user when struck. It really feels so much different than the cheap keyboards people usually use.

The advantages of this keyboard include a smaller size, so I'm reaching less for the mice, and a better typing experience. The only disadvantages are that the keyboard was quite expensive (about $100) and I still need a numeric keypad - just not right on the same keyboard. So I also purchased a keypad that sits on the right side of the monitor table within reaching distance when it's needed.

I own the same monitor as before, but my new employer provided a monitor with a wider screen, so I use that one. It's nice having the wide screen for videos, but most of the time I don't use that much screen real estate. In fact, I have gotten use to keeping application non-maximized so I can see other applications at the same time or hiding in the background.

I built my own standing desk out of the old corner computer desk, which worked great for a while, until I got the new job working from home. At that point, I needed to re-evaluate the space requirements. I needed to be able to sit and write on occasion. I did some research and found an article suggesting a very cheap Ikea standing desk. I didn't have a desk, though (since I ripped it up to make the first standing desk), so I decided to buy a table to place it all on. I ended up getting a table with adjustable A-frame legs. I figured the A-frame would provide greater stability, especially given that I planned on getting a treadmill.

A few final adjustments to the desk: I used a dowel to lift up the front of the keyboard shelf to allow for a more natural resting place for the hands, I put some old textbooks under the monitor to raise it up to the correct height, and I recovered the old keyboard tray for the sitting position. Now the desk is good for sitting and standing at my work computer. The sitting option only has one mouse, mostly because I didn't want to spring for another $100 keyboard. You want the nice keyboard, you have to work for it by standing or walking.

You may have noticed the odd device just below the monitor, with a string coming down from it: It's the control for a treadmill by LifeSpan, along with the safety cord that stops it when pulled. I decided I wasn't working out enough, and I thought it would be great if I could use a treadmill while working. That arrived about 3 months ago, and I had to raise up the desk to accommodate its height. Here's the complete setup:



The work laptop is over on the left side of the table, and when it is flipped open, it can be used while sitting. I usually do this when I'm tired or I have a meeting to attend (standing or walking at the treadmill is too distracting for all parties while on a call). My home computer (the tower behind the laptop) is only connected to the standing desk monitor, so I don't have a choice but to stand at that one. I suppose I could use my old monitor to figure out a sitting situation, but I haven't really needed to sit at my home computer. I have a KVM switch to toggle the computers, and the switch is just under the monitor.

I'm still getting used to standing and/or walking while working, mostly the impact on my body. I have not found it difficult to many tasks while standing or walking. The only exception is, as I mentioned above, phone calls or meetings. Usually I want to take notes, so that's easier while sitting. I'm not used to standing or walking for hours on end, so when I get tired, I sit.

One of my colleagues asked how I was able to walk and work at the same time. It's not too difficult. As proof, here's a short and incredibly dull video of me walking while working:

From Chris's Album

In the video, I did the following tasks, not exactly in this order:
  • Wrote some code
  • Ran a command on the Linux command line
  • Reviewed the output from the above process
  • Reviewed a file of health-care-related records
  • Read some code
  • Thought about it for a bit
  • Figured out why a file had duplicates
  • Wrote an email
  • Drank some water
I was going 2 miles per hour (the treadmill ranges from 0.1 to 4 miles per hour, in 1-tenth of a mile increments). I find it is nice to back up a step while reading or thinking in order to walk more naturally. I would also recommend using a sports bottle. I made the mistake of using an open-topped cup, which could be spilled quite easily on any of the computer components or the treadmill.

I would highly recommend this arrangement and all of these products. The mice from Logitech, keyboard from EliteKeyboards, assorted desk stuff from Ikea, and the treadmill from LifeSpan. It makes for a good way to work and get a bit of exercise.

Wednesday, May 20, 2015

KeePass2 and Gmail

It's been a while since I posted, but with a good cause: We just had our second child a few weeks ago!

The other day, Google decided to change Gmail's login screen to be in three parts (two for those without two-step authentication): 1. Username 2. Password 3. Two-Step Authentication. This is annoying because I use KeePass2 with auto-type, and the new pages interfere with the auto-typing mechanism.

Today I decided to solve that issue, and I'm posting it here to share with anyone else who may also have the same problem. Here are the steps:

  1. Open up the Gmail or Google Apps email account you want to change ("Edit/View Entry" on the context menu, or hit Enter when the entry is selected).
  2. Go to the Auto-Type tab.
  3. Select "Override default sequence" option.
  4. The field should initial have (without quotes):

    "{USERNAME}{TAB}{PASSWORD}{ENTER}"

    Replace that with (without quotes):

    "{USERNAME}{TAB}{ENTER}{DELAY 2000}{PASSWORD}{TAB}{ENTER}"

  5. Test in your favorite browser. Adjust the "{DELAY mmmm}" parameter by replacing "mmmm" with a different numeric value. This is in milliseconds, so a value of 1000 is 1 second.

The "{DELAY 2000}" parameter is the key to fixing the issue. Since there is a new page in between the username and password, we need a delay to let the page load before typing the password.

For more information on the auto-type feature and the parameters that can be used, see the KeePass documentation here:

http://keepass.info/help/base/autotype.html

Don't forget to tag your Gmail account as an OpenSSL account (add "OpenSSL" to the description) while you're at it!