Thoughts from Both Sides of the Brain

A COVID-19 Modeling Story

2023-06-20T19:25:00.003-05:00

This is a retrospective post that I wrote some time ago, but never finalized.

Early in 2020, I was working with a healthcare organization that owned hospitals, clinics, and a health insurance company. I was asked by a colleague whether I thought this new disease coming out of China, called COVID-19, was a concern. At the time, I said that the flu season appeared to be more of a concern since it had been rather bad. Supporting this point of view and providing a source of personal bias, I had a very bad upper respiratory infection in Dec 2019 that lasted for weeks. Like many people, especially in the United States, I was very wrong about COVID-19

Once it became apparent that the spread of the virus SARS-CoV-2 was out of control in the US, I was pulled into a small team dedicated to providing the organization with predictive modeling of the COVID-19 hospital admissions, ICU beds, and ventilator usage. The team consisted of a statistician (PhD), an epidemiologist (PhD), an operations researcher / industrial engineer (PhD and masters, respectively), and me (bachelor's degree in psychology). The statistician was the only individual in the entire organization with the title "Data Scientist". My title at the time was "Senior Healthcare Analyst", and I barely had any qualifications to be included on this team. If I guessed as to why I was selected, I would say: 1) I'm a creative problem solver and pretty good at automation; 2) I am familiar with a variety of technologies, data sources, and analytical methods (including a bit of statistics and machine learning); and 3) there wasn't anyone else who could help.

We quickly started evaluating different methods of estimating COVID-19 hospital admissions. The leaders brainstormed with other organizations in the region and discovered a few tools to evaluate. The first was a SIR model, which breaks up populations into three groups: Susceptible (S), Infected (I), and Removed (R). (Often R is "Recovered", but death is also a possibility; thus, "Removed" is more appropriate, although people could get reinfected.) This model shifts people between the three categories and assumes an upward trend, a single peak, then a downward trend. Graphs based on these models resemble bell curves.

In order to tell this story, I have masked the data and simply labeled the volumes according to their subjective magnitude. The models all project hospital "census", meaning the number of beds that are in use each day, for COVID-19 patients. Here is the projection from Model 1.

Model 1. SIR Model with Admissions, ICUs, and Ventilators

Model 1 foretold absolute disaster. The magnitude was many more times the available capacity. In essence, it did not matter what the y axis displayed: it was so large, no one would be able to handle the volume of hospital admissions. Model 1 assumed no interventions from anyone and was based on initial measurements of the reproductive number at about 3.2. The reproductive number can be interpreted as the number of people who become infected for every 1 person who already is infected. The first value of this number is the "basic reproductive number", R-naught, or R0; whereas subsequent numbers are called the "effective reproductive number", Re, or Rt.

Quickly, interventions began at many levels: national, state, and county. As a result, the initial model was no longer accurate. At this time, we still did not have actual admission data to compare with the models to assess accuracy, so we were still "flying blind".

Model 2, based on a logistic regression model fit to the Italian data, was much less dire. However, the model quickly doesn't make any sense, as the set of functions, called sigmoid functions, are monotonic functions - they never decrease (or increase if you flip it around). So the model was only good for the next couple of weeks. However, it gave us a more reasonable number to work with. We still didn't have actual data, but we did have bed capacity information.

Model 2. Logistic Model with Bed Capacity

Model 3, attempting to solve the problems with Model 2, was based on new studies from New York and other areas of the world, and was fit with a polynomial regression. Again, not a great model, but it did provide a more palatable near- and far-future state. The peaks of both Model 2 and Model 3 were very similar. We continued to compare to bed capacity.

Model 3. Polynomial Model with Bed Capacity

Model 4 was developed at about the same time as receiving actual hospital admissions data. Within 24 hours, I worked closely with an electronic medical record subject matter expert ("EMR SME", I suppose) and collected admissions, ICU stays, and ventilator usage for COVID-19 patients. Finally, we had actual data to compare with our models.

Model 4. New SIR Model with Actual Hospitalizations

This model was another SIR model, based on a publicly available algorithm; however, unlike other SIR models, it incorporated changes in interventions by adjusting the reproductive number over time. Instead of assuming the number of cases would always increase at a flat rate, it spread the cases out based on the observed spread. Just like the prior SIR model, it suffered from modeling only one peak and monotonically descending from that peak. That is, it never increased again.

This worked very well for a few months.

As the actual data and other public data sources demonstrated, cases can increase and decrease chaotically over time, based on the severity of interventions and whether people adhere to those interventions. It was possible that SARS-CoV-2 mutated, maybe a few times, and became more infectious as a result. In some regions we observed a "third peak", some of which were higher than the first, and at the time it wasn't done increasing, either.

Model 5 attempted to overcome the issues with SIR models, allowing for chaotic increases and decreases while still incorporating changes in the reproductive number. So far in this discussion, I have not explicitly named the models based on the sources from which they were based, for a variety of reasons. This model, though, is based on the LEMMA tool (Local Epidemic Modeling for Management & Action). Initially, this tool was built in R.

Model 5. LEMMA (R)

The LEMMA tool was "designed to provide regional... projections of the SARS-CoV-2 (COVID-19) epidemic under various scenarios" (from the tool's website). The model provided inputs for dates, population, observed cases (admissions, ICUs, and deaths), various parameters for PUI (patients under investigation) estimates, and more. It was incredibly flexible and, at the time, well-received.

At a certain point, the LEMMA developers decided to switch the foundation of the algorithm from the R-based models to a C++ package called Stan, which at its core is used for Bayesian statistics. As noted on the LEMMA website, "[The LEMMA] Stan implementation is based on the 'Santa Cruz County COVID-19 Model' (https://github.com/jpmattern/seir-covid19) by Jann Paul Mattern (UC Santa Cruz) and Mikala Caton (Santa Cruz County Health Services Agency)".

This model was a significant improvement, providing more intervention dates and input parameters. Model 6 was the last model, based on the LEMMA Stan package:

Model 6. LEMMA (Stan)

In this variant of the model, we were able to extract the percentiles for the simulations and treat them as upper and lower estimates of the main prediction, specifically using the 95th and 5th percentiles, respectively.

Note in the graphs, when comparing Model 5 and 6, that the two "bumps" in the early pandemic stages are more accurately modeled by Model 6. The first LEMMA model almost appeared as if there were two SIR models hiding underneath, with the maximum predicted values taken from each. Model 5 monotonically descended from the the first peak until a specific point where it changed direction. Model 6, on the other hand, had many changes of direction, leading to a more trusted result.

We ran this model once a week. Interventions, social behaviors, and the reproductive numbers changed rapidly within each of the geographic regions where we projected COVID-19 admissions, and as a result, we needed to continuously recalculate the model based on the latest information.

Initially in Model 6, we guessed at the impact of various interventions in each region. At the same time, we were calculating the reproductive number in a separate process and reporting on it internally. My idea was to merge these two analyses in order to calculate the LEMMA model "intervention" percentage based on the actual observed reproductive number. I calculated the percent change from week to week and used that figure instead of guessing at the impact of various changes.

It wasn't perfect, but it worked rather well. I left the organization, and I heard from my former colleagues that the model continued to be used, and continued to take longer and longer to run. I suggested a few times that they simply cut off the model at a certain date and restart it, then merge the two results. Fortunately, by the time of this publication, it is no longer being used at all.

The models served their purpose - to provide foresight for planning and maybe a bit of fear, to help monitor, and to see the impact of interventions. Fortunately, we're out of the pandemic now (in my opinion). I post this now, finally, after having written it nearly two years ago, as a sort of professional horror story, a warning to others in the future, a reminder of what we need to do better.

Swenson's Law

2021-03-13T09:29:00.003-06:00

The other day we were going through agile training, and one of my colleagues was struggling with the concept of assigning unit-less numbers to work effort / difficulty instead of hours or a 1-10 rating.

I tried to convey that when we estimate tasks in hours or dollars it is often very wrong anyway, and gave a couple of examples.

Let's say you estimate $10,000 for a project. On day one, you find out that the software you expected to be available is not, and you have to pay $2,000, immediately. So now your estimate is off by $2,000, immediately.

Instead of estimating the project with a specific dollar amount and being immediately wrong, the agile approach encourages figuring out the meaning of work effort through experience, by assigning somewhat random values to projects and reflecting on that estimate, ideally, improving your ability to estimate tasks in your own terms.

This led me back to an earlier realization I had while doing home projects:

Swenson's Law: It's never just "one thing".

Let's say you want to wash the exterior of your spouse's car. You go to pull it out of the garage / parking space and realize that it's filthy inside. It really needs to be cleaned inside, too. So you go to get the vacuum cleaner, but you find out that has a broken part. So you go to order the replacement part online, but you get an email from your tax accountant noting a missing document. So you go to scan the document and send it.

If this were a joke:

Spouse: I thought you were going to wash my car!

Me, sitting at a computer: I am dammit!!!

I originally wrote a note about this back in 2015, before my daughter was born, but forgot to publish it:

I went into the kitchen to throw away a tissue. I noticed the trash was full. I thought: "Oh, I'll just take out the garbage." However, I then noticed that the top of the trash can was rather dirty, so I went to get a wet paper towel to clean it off. I turned to find that the paper towels were out.

So then I went downstairs to get some more paper towels. I remembered that we had laundry to put away, and that later today, when we usually do laundry, we would be out to the hospital for a tour. We had to do laundry earlier, likely about right now.

I got the paper towels, came back upstairs, cleaned the trash can, and while I was taking out the trash, noticed the recycling bin was also full. So that had to be taken care of, too.

This situation, along with many others (e.g., fixing anything in the house; replacing a light switch ended up taking half a day), resulted in a conclusion: It's never just "one thing".

Home Projects

2020-03-21T10:25:00.002-05:00

Stuck at home? Bored? Here are a few ideas:

1. Learn a new skill.

I highly encourage everyone to start learning a new skill online. Find a good service that is cheap or free, with a good online learning system. Be sure to schedule time on it every day, even if it's just 15 minutes. Learn to use Excel, program in SQL or Python, or learn a new language. These skills will always be useful.

Don't be limited by online learning. Has that piano, guitar, or flute been collecting dust? Clean it up first, show it some care that you don't usually have time for, and give it a go. If you're already good at it, consider teaching someone else.

That said, see if someone around you (physically or digitally) is interested in the same skill, and try to learn it together. If you learn a new skill together, you can help each other, encourage each other, and ensure you're committing to it every day.

On the flip side, if you're good at teaching, consider tutoring online or creating content for one of those online learning systems.

2. Check your living space for expired or out of date things.

Check your light bulbs for any incandescent bulbs. Replace any you find, if you have extra bulbs. Consider ordering LEDs, if it doesn't interfere with other deliveries. Initially I bought daylight LEDs, but we found them too obnoxiously bright in the evening and at night, so I recommend soft yellow lights.

Similarly, check your fire extinguishers. You may not be able to get them officially checked, but you can make a list of things to do post-quarantine. Expired fire extinguishers should be replaced if they have plastic handles; otherwise, they can be recharged by a certified specialist.

There are lots of other things in your house to check. Furnace filter, water filter, fridge filter, humidifier filters. Ok, so lots of filters.

3. Organize your storage items.

The best idea that I've had regarding storage is to NOT label boxes with words. Instead, use a number and keep a list with these columns: Box Number, Contents, and Location. If you're living in a small space, you may not need a Location; but even in a small spaces boxes are easier to locate when the location is noted. Using this method of numbering boxes helps to avoid covering up old labels or leaving misleading labels.

To identify the location of a box when they're on shelves, I use a location name with a letter / number coordinate just like spreadsheet software (Spreadsheet Cell Reference). Letters for columns (groups of boxes going up and down) and numbers for rows (groups of boxes going left and right). For example:

Number: 5, Contents: Pictures, Location: Storage Shelf B2
A B
1 |______|______|
2 |______|_Box5_|
3 |______|______|

Don't forget your digital storage! Organize your photos and videos, too. I use year/month/day format for pictures and videos, and I group pictures and videos by year/month/day folders. Sometimes if there aren't enough pictures to justify an entire folder, I lump some together.

For example, if there are only 5 pictures in Feb 2019, I put them all in a folder named 20190201 with picture names like 2019-02-01_0950.png or 2019-02-25_1735.png. Using the year/month/day format at the beginning, the pictures will sort correctly even if you edit them at a later date. For major events, like birthdays with lots of photos, I add an even name after the folder year/month/day, like so: 20190525_Anniversary_Party.

4. Clean frequently touched items or replace them with automated items.

The CDC recommends cleaning frequently touched things every day. What I find quite obnoxious about this advice is that they don't tell you how to do it. I use a mix of water and bleach for light switches, doorknobs, and keyboards, but what about your phone? What are you supposed to use when it shouldn't get wet? I have a product that says it cleans phones, but there's little evidence that it does. Perhaps getting a slightly damp cloth and wiping down your phone and then immediately drying it is the best you can do.

Consider replacing frequently used light switches with automatic switches so you don't have to touch them. I bought one automatic switch for the main floor bathroom. This is likely not feasible for everyone, especially if you only have 1 bathroom. I suppose a voice-command light would be better so it doesn't turn on at night, but likely much more expensive than a simple motion detection switch. While you're at it, if you have any outlets that are lose, figure out the breaker, turn them off, and tighten them.

Personally, I don't like smart voice-command items like Alexa, but right now they seem like a smart choice since you can do so much without touching anything! No need to touch the speaker to play music, use your keyboard to search and order something, or get a reminder on your phone that you have to swipe to view. I might have just convinced myself to buy one...

5. Write about your experience.

One thing that has tripped me up over the years regarding my health is that I'll do something or something will happen, and months later I will have forgotten the solution or details of what occurred. If I had written a health journal, it would have helped me remember. This may be a great idea during this time of health crisis, especially if you aren't able to communicate your past condition.

It may also be cathartic to write about your experience, to write about how you feel, or to translate your experience and feelings into a work of fiction. My wife writes fiction as sort of a therapeutic practice. She doesn't let anyone read it (so far), but it really helps her work out emotions and life events in a different way.

One thing that I've forgotten to do over the years is to send an "update" email to people I care about, personally and professionally, updating them about my life. I usually do this by email, but I treat the writing of it like a long-form letter. I write it as if I won't get a response, like a one-way communication method. I usually get quite a few responses, and I think people appreciate hearing about changes in my life. I take the time to thank people, too, for helping me get where I am. And I always ask, "what's new with you?" It's been a great way to keep in touch.

Whatever you do in large-group emails, DO NOT put everyone's email in the TO or CC field! Put everyone in the BCC field and put your own address in the TO field. Your contacts may not want their email shared with 10s or 100s of others, and if any of those addresses are compromised to a hacker, you didn't share any email address with that hacker except your own.

6. Listen to music and stories the old-fashioned way.

How often do we sit back on the couch and just listen to music? No phones or tablets or books. People used to do that! Just listen. Share your music interests with those around. Nowadays music preferences are so private, who knows what you like? Does anyone know you like to listen to death metal during your workout? Or J-POP on the way home from work? Who knows, someone you share with may really like it too.

I heard about a deal on a certain website that sells audio books. Maybe it would be fun to gather the family around the "radio" and listen to a chapter from a good book! Make some popcorn! Make it a weekly event.

I recently pulled out some old tapes I made as a kid, and my son and I listened to the goofy tapes for a good 30 minutes.

7. Share your ideas about what to do while stuck at home!

Of course we should all exercise more, and we need to be rather creative about it stuck at home. How do you manage? Any other ideas about what to do at home?

Tough Year

2017-09-13T17:03:00.003-05:00

It's been a tough year. I covered the beginning of the year in another post regarding the Calculus course I took in preparation for applying to a master's program in applied statistics. I had planned on studying for the GRE this summer and taking it this fall as well as applying for the master's program, then working through the master's program for the next 3-5 years. My employer, HDMS (a subsidiary of Aetna), was going to help pay for the degree. However, those plans were going to get thrown off course.

I had just finished submitting my coursework for reimbursement when I was sent an out-of-place meeting request. At the meeting, I found out HDMS was letting me go. At first it appeared it was just me, but as I found out later in the day, they were laying off about 10 other employees and closing about 10 other open positions. For two hours, I was trying to figure out what I did wrong - but it had nothing to do with my performance. I was the most recent hire on the team, and there were others getting laid off too.

I hit the ground running. The severance package included a career consulting service - I looked it up and scheduled time to review my resume with a consultant. The paper / PDF version has undergone quite a few revisions over the past few weeks. I also started contacting my network, browsing through online job boards, and all the usual job-hunting tasks. I did find quite a few roles through my network, and a couple of them resulted in offers.

I quickly found a role as a consultant data analyst with Great Wolf Resorts. I stayed there for a few weeks, working on a single project integrating credit card transactions with the reservations. Essentially, if a guest uses a credit card for something on site (e.g., restaurant) and does not charge it back to the room, the transaction is not connected to the reservation. In order to connect the two, I had to merge transactions with reservations based on guest name and, if available, the last four digits of the credit card number. It was messy, and I was able to get about 57% of the transactions matched. I believe the best possible rate was somewhere around 65%, but it would have required a lot of exception handling, manual matching, and/or time-intensive matching processes (e.g., matching text within another text field). The company analyst and I decided the additional matches weren't worth the expense.

The position was a good fit for my skills, and I enjoyed working with the people there, but as a consultant role, the benefits were very expensive and of course it could have ended at any time. So, I kept looking for permanent roles while I was there. I need something more permanent right now, but I can definitely see myself as a successful consultant. In my short time there, I think I demonstrated a lot of value with my skills and the process and analysis I left behind.

Recently, I found a new role as a Senior Healthcare Analyst at SSM Health, a non-profit healthcare organization with hospitals from Wisconsin to Missouri. They also own Dean Health Plan, where I worked a few years ago. I still know a few people there, so it will be good to reconnect with them. I'll be analyzing healthcare data for a particular region of the system, starting in a few days. I feel very good about the team and the leader, so I'm looking forward to getting started. Luckily, I'll be working from home again, so I'll get to use my treadmill again.

Here's hoping quarter four is quite a bit less turbulent!

Calculus III

2017-08-01T12:12:00.001-05:00

In the last year or so, I decided to apply for a master's program in applied statistics, but I was missing one of the prerequisite mathematics courses: Calculus III. I had taken calculus courses in high school and college, but those courses were more focused on applications. Furthermore, I hadn't covered any of Calculus II in those courses.

Instead of taking the entire series, which would have taken quite a bit of time and money, I decided to do something rather daunting: I took Calculus III online and used Khan Academy and other sources to catch up on Calculus I and II. I read reviews that the first few weeks were tough even if you had taken Calculus I and II just before III. Undeterred, I started the course earlier this year, and the first few weeks were indeed tough.

The online program I used - NetMath - used an online math tool for running code and submitting homework. Each student is assigned to a mentor who grades assignments, answers questions, and ensures each student is on schedule. Students receive feedback on their homework and are able to re-submit corrections a couple of times. The two midterms and final must be taken in person with a proctor.

My first mentor was not very responsive. On week 2, a critical week in the program, my mentor did not respond to emails or grade my assignments in a timely fashion (within 3 days, as noted in the program handbook). I notified the program administrators and they assigned me a new mentor. She had quite a bit of catching up to do, but she did her best and eventually graded the outstanding assignments and responded to my questions. Honestly, she was amazing, and I'd write her a letter of recommendation if she asked.

Lesson 2 is quite difficult. It's really the first lesson on the topic of the course, where lesson 1 was review of parametric equations and other necessary concepts, and it's there in case anyone missed or forgot these topics. With the combination of difficult content and slow responses from my mentor, it took me 2 weeks to finish lesson 2. In addition, I got sick for a couple of days and there was a death in the family, which put me behind another 2 weeks or so. Fortunately, the program offers a two-month extension, and I planned on using it if needed.

However, there were additional, serious problems with the course. One of the most grievous was incomplete or incorrect content. There were often no terms given to ideas, preventing students who have taken this course from communicating the concepts effectively. For example, vector projection was just called "vector push on another vector". It took me quite some time to find the right term to be able to research this concept online.

The course also neglected saddle points and claimed that whenever a gradient was {0, 0} (or more zeros depending on the number of dimensions), that the point was a minimum or maximum of the function. This is blatantly not true when a saddle point is present, and it would be terrible for students to internalize this falsity since it is profoundly meaningful in calculating predictive models with machine learning, specifically neural networks. You can't assume you've optimized a function when the gradient is {0, 0} without looking around it to see if you've found a saddle point.

All told, I was quite unhappy with this course. Not only was I spending a lot of time catching up, but I was spending time trying to learn the material through other sources since the course material was incomplete or inaccurate. Nearing the end of the course, I was able to catch up to the point where it looked like I could finish the course if I just had another week or two. I emailed the administrators for a course extension, noting the reasons I had for the delay and the issues I had with the course, and they offered a shorter extension so I could finish it without rushing and without taking the full extension. (My schedule to finish without an extension would have been very demanding for my mentor to complete all the grading in time.)

Despite all the delays and issues, I finished all the material within the original time frame, and I just needed to take the final. I studied for a few extra days and took the final about 2 weeks after the course originally ended. Since the final was comprehensive and included the last three lessons, I was very nervous about it. I had aced the midterms, but there was just so much to remember (e.g., the curl of a 3D field is difficult to remember). To my great surprise and delight, I not only aced the final, I got an A+ in the course. I was relieved!

Now I just have to re-take the GRE and apply for the master's program.

Gazetteer Database for Geographic Analysis

2016-01-08T11:25:00.002-06:00

A couple of years ago, I had a tricky problem to solve. I inherited a tool a group of analysts were using to allocate website search based on ZIP code and location name (e.g., city, most commonly) for clients based on their own locations. The tool used the output of a predictive model for website search activity and inputs from the client, including addresses, for configuring the search locations that would be allocated for the client.

In addition to setting up relevant geographies based on the client's locations, the tool attempted to collect additional nearby locations that were likely relevant to the client (a "market"). The problem was that it did not find good matches for cities, towns, and other locations people were using on the website. As a result, the analysts were doing quite a bit of work to correct the output by removing and adding locations by hand. It was very time consuming, and I had to do something about it.

EDIT: I updated the following paragraph after I remembered how the algorithm was originally working. Initially I wrote that it calculated distances between locations, but it did not.

I reviewed the process and the data used to obtain location names. The algorithm used a simple lookup from ZIP code to location name, usually city or town. It did not attempt to look up nearby location names. The data did include latitude and longitude for the locations, so I thought I'd try adding code to lookup nearby locations with this data. I asked around in the software development area and found that they were using a fuzzy distance calculation based on a globe. When I tried it out using the existing location data, I found several problems. Some of the latitude/longitude coordinates were in the wrong state or in the middle of nowhere. Additionally, the data was missing quite a few relevant locations, like alternative names for cities and towns, as well as neighborhood names, parks, and a variety of other place names people use in web searches. I discovered it was several years out of date, and there was no chance it would be updated. So I decided the data was simply junk. I had to find a new source.

I began searching online for government sources of location information. After all, the US government establishes ZIP codes, city and town designations, and executes the census every once in a while. The US government also has to release this data publicly, according to law. (This doesn't mean it's free, or easy to obtain.) So there must be publicly-available data regarding locations. Luckily, I ended up finding a free online source: the US Gazetteer Files (see "Places" and "ZIP Code Tabulation Areas" sections).

What's a "gazetteer"? A gazetteer is a list of information about the locations on a map. In this case, the US Gazetteer data includes latitude and longitude, useful for geographic analysis.

As I used the data, I found a few gaps, so I searched again and found the US Board on Geographic Names (see "Populated Places" under "Topical Gazetteers"). By integrating these two data sets, I had a rather comprehensive listing of all sorts of places around the US.

Next, I had to get the new location data working with the search configuration tool. The tool was written with a web front-end for the inputs, SQL to collect the data and apply the inputs, and Excel as the output data. So I had to do a bit of ETL (actually, I did ELT, loading before transforming) to get the new location data working with the tool. I ended up designing the model pictured here:

The main data is in gz_place and gz_zip, storing locations and ZIP code data, respectively. On the right of gz_place are some lookup tables, including a table with alternative names (gz_name_xwalk - "xwalk" meaning crosswalk). The ZIP table references a master list of potential ZIP codes (see the prior post about creating that table), a list of invalid ZIP codes that showed up in the prior location data, and a list of ZIP codes I determined were "inside" other ZIP codes (the algorithm for is another discussion entirely).

The data on the left is a bit more interesting. There are some metadata tables not really connected to the rest (gz_metadata, gz_source), documenting quick facts about the data and where I found the data. Two reference tables also float off on their own, with a list of raw location names (gz_name) and a list of states (gz_state_51 - 51 to include DC), each including associated information.

Now I didn't want the tool to calculate distances between everything and everything else each time an analyst ran the tool, so I decided to precompute the distances and store only those within a certain proximity. I decided there were 3 types of distances required: ZIP to ZIP, location to location, and location to ZIP (and it could be used vice versa). To limit processing, I used a mapping of states and their neighbor states to connect the initial set of ZIPs and locations to use. This helped to decrease the run time. At the same time, I calculated the distances between each set of latitudes and longitudes, and retained only those within a certain number of miles. The final, filtered results are stored in gz_distance, with a lookup table describing the distance types (gz_distance_type).

Finally, I could get the better location data into the tool. I replaced the original code with new code that uses the new location data, doing a simple lookup of the locations specified by the client (ZIP codes) and filtering for an appropriate distance. I created a few new inputs to help the analyst tweak the distance that the tool would use to filter the crosswalk, with the idea that clients in rural areas may find a larger area more relevant, and clients in dense urban areas may find a smaller area more relevant.

The results were excellent. The analysts praised the new process for being more accurate, less time consuming, and easy to use. There were some manual aspects to the process, for example, correcting spelling errors entered by users on the website, but these would become less of an issue as time went by. (Especially the spelling errors. The website administrators were switching from one vendor data set to another, which had better location suggestions/requirements based on the user's input.) Overall, it was almost completely automated and only required updates once in a while when new locations were added.

This was one of those projects where I really enjoyed the autonomy I was given. I was simply given a task (make this tool work better), and given free reign over how to do that. I worked with many people to get their feedback and help, especially from the database maintainer and a few users for testing the new inputs on the tool. (One interesting thing I did with the database was to partition the gz_distance table based on distance type. I got help from the database maintainer on the best way to do that.) And best of all, I really enjoyed the project.

Evolving Desk

2015-06-26T17:32:00.001-05:00

I previously wrote about my slightly unusual computer desk setup. I still use the same keyboard/mouse setup: a trackball mouse on the right, a regular mouse on the left (with a sticky note covering the laser so it doesn't move - it's used for scrolling and clicking), and the keyboard in the middle. I don't use the extra bluetooth mouse as much, since I've gotten used to being precise with the trackball mouse. (All the mice are still the same Logitech mice.) The USB extension cord is still there, too.

I have, however, upgraded quite a few other aspects of the desk. I upgraded the keyboard to a "tenkeyless tactile touch" keyboard from EliteKeyboards. It is missing the numeric pad (thus "tenkeyless") and it has special keys -- called tactile keys because they provide more feedback to the user when struck. It really feels so much different than the cheap keyboards people usually use.

The advantages of this keyboard include a smaller size, so I'm reaching less for the mice, and a better typing experience. The only disadvantages are that the keyboard was quite expensive (about $100) and I still need a numeric keypad - just not right on the same keyboard. So I also purchased a keypad that sits on the right side of the monitor table within reaching distance when it's needed.

I own the same monitor as before, but my new employer provided a monitor with a wider screen, so I use that one. It's nice having the wide screen for videos, but most of the time I don't use that much screen real estate. In fact, I have gotten use to keeping application non-maximized so I can see other applications at the same time or hiding in the background.

I built my own standing desk out of the old corner computer desk, which worked great for a while, until I got the new job working from home. At that point, I needed to re-evaluate the space requirements. I needed to be able to sit and write on occasion. I did some research and found an article suggesting a very cheap Ikea standing desk. I didn't have a desk, though (since I ripped it up to make the first standing desk), so I decided to buy a table to place it all on. I ended up getting a table with adjustable A-frame legs. I figured the A-frame would provide greater stability, especially given that I planned on getting a treadmill.

A few final adjustments to the desk: I used a dowel to lift up the front of the keyboard shelf to allow for a more natural resting place for the hands, I put some old textbooks under the monitor to raise it up to the correct height, and I recovered the old keyboard tray for the sitting position. Now the desk is good for sitting and standing at my work computer. The sitting option only has one mouse, mostly because I didn't want to spring for another $100 keyboard. You want the nice keyboard, you have to work for it by standing or walking.

You may have noticed the odd device just below the monitor, with a string coming down from it: It's the control for a treadmill by LifeSpan, along with the safety cord that stops it when pulled. I decided I wasn't working out enough, and I thought it would be great if I could use a treadmill while working. That arrived about 3 months ago, and I had to raise up the desk to accommodate its height. Here's the complete setup:

The work laptop is over on the left side of the table, and when it is flipped open, it can be used while sitting. I usually do this when I'm tired or I have a meeting to attend (standing or walking at the treadmill is too distracting for all parties while on a call). My home computer (the tower behind the laptop) is only connected to the standing desk monitor, so I don't have a choice but to stand at that one. I suppose I could use my old monitor to figure out a sitting situation, but I haven't really needed to sit at my home computer. I have a KVM switch to toggle the computers, and the switch is just under the monitor.

I'm still getting used to standing and/or walking while working, mostly the impact on my body. I have not found it difficult to many tasks while standing or walking. The only exception is, as I mentioned above, phone calls or meetings. Usually I want to take notes, so that's easier while sitting. I'm not used to standing or walking for hours on end, so when I get tired, I sit.

One of my colleagues asked how I was able to walk and work at the same time. It's not too difficult. As proof, here's a short and incredibly dull video of me walking while working:

In the video, I did the following tasks, not exactly in this order:

Wrote some code
Ran a command on the Linux command line
Reviewed the output from the above process
Reviewed a file of health-care-related records
Read some code
Thought about it for a bit
Figured out why a file had duplicates
Wrote an email
Drank some water

I was going 2 miles per hour (the treadmill ranges from 0.1 to 4 miles per hour, in 1-tenth of a mile increments). I find it is nice to back up a step while reading or thinking in order to walk more naturally. I would also recommend using a sports bottle. I made the mistake of using an open-topped cup, which could be spilled quite easily on any of the computer components or the treadmill.

I would highly recommend this arrangement and all of these products. The mice from Logitech, keyboard from EliteKeyboards, assorted desk stuff from Ikea, and the treadmill from LifeSpan. It makes for a good way to work and get a bit of exercise.

KeePass2 and Gmail

2015-05-20T11:58:00.001-05:00

It's been a while since I posted, but with a good cause: We just had our second child a few weeks ago!

The other day, Google decided to change Gmail's login screen to be in three parts (two for those without two-step authentication): 1. Username 2. Password 3. Two-Step Authentication. This is annoying because I use KeePass2 with auto-type, and the new pages interfere with the auto-typing mechanism.

Today I decided to solve that issue, and I'm posting it here to share with anyone else who may also have the same problem. Here are the steps:

Open up the Gmail or Google Apps email account you want to change ("Edit/View Entry" on the context menu, or hit Enter when the entry is selected).
Go to the Auto-Type tab.
Select "Override default sequence" option.
The field should initial have (without quotes):

"{USERNAME}{TAB}{PASSWORD}{ENTER}"

Replace that with (without quotes):

"{USERNAME}{TAB}{ENTER}{DELAY 2000}{PASSWORD}{TAB}{ENTER}"
Test in your favorite browser. Adjust the "{DELAY mmmm}" parameter by replacing "mmmm" with a different numeric value. This is in milliseconds, so a value of 1000 is 1 second.

The "{DELAY 2000}" parameter is the key to fixing the issue. Since there is a new page in between the username and password, we need a delay to let the page load before typing the password.

For more information on the auto-type feature and the parameters that can be used, see the KeePass documentation here:

http://keepass.info/help/base/autotype.html

Don't forget to tag your Gmail account as an OpenSSL account (add "OpenSSL" to the description) while you're at it!

Strange SAS Error Message

2014-11-25T13:09:00.004-06:00

I spent far too long trying to debug a strange error message in SAS. The solution ended up being aggravatingly simple, but arriving at the solution was not. So I wanted to share the problem and solution. (I discovered this issue using SAS 9.1.3.)

MPRINT(TRANSPOSE): proc datasets nolist;
SYMBOLGEN: Macro variable OUT resolves to lib.table_name
NOTE: Line generated by the invoked macro "TRANSPOSE".
161 modify &OUT; label &NEWVAR = "&NEWLBL"; quit;
               _
               200

MPRINT(TRANSPOSE): modify lib.table_name;
NOTE: Line generated by the macro variable "OUT".
161 lib.table_name
    _______________
    22
NOTE: Enter RUN; to continue or QUIT; to end the procedure.

SYMBOLGEN: Macro variable NEWVAR resolves to count
SYMBOLGEN: Macro variable NEWLBL resolves to count: prevalence
MPRINT(TRANSPOSE): label count = "count: prevalence";
MPRINT(TRANSPOSE): quit;

ERROR 200-322: The symbol is not recognized and will be ignored.
ERROR 22-322: Expecting a name.

NOTE: Statements not processed because of errors noted above.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: SAS set option OBS=0 and will continue to check statements. This may cause NOTE: No observations in data set.
NOTE: PROCEDURE DATASETS used (Total process time):
      real time 0.00 seconds
      cpu time 0.00 seconds

If you start looking where the error statements start and look down, you'll miss the true source of the error.

So for those not familiar with PROC DATASETS, one of the options on the PROC DATASETS statement is the LIBRARY= option. It defaults to WORK. So this PROC DATASETS is looking only in WORK for the dataset lib.table_name. Since "lib." is not a valid part of a table name (and just the table name), it's not parsing it correctly, resulting in an error.

The error seems to cause the SAS code parser to trip and read the rest of the code incorrectly, and it looks like the following step is where the error occurs. However, it's actually the earlier step ("modify &OUT" or the resolved code, "modify lib.table_name").

The lesson is twofold: 1) Don't try to use an output library with the table name (e.g., "lib.table_name") when the LIBRARY= option on PROC DATASETS is not used. 2) Don't write a macro where an output data set can be specified (e.g., OUT=) and use PROC DATASETS without checking whether the user specified an output library. Alternatively, don't use PROC DATASETS. (There are some advantages to PROC DATASETS, so don't throw the baby out with the bathwater!)

I wasted a lot of time on this since I thought the macro could handle any type of input/output specified, but apparently that is not true. The solution for me was to remove the "lib." from the OUT= argument to the macro ("&OUT"). Since PROC DATSETS was looking in work, it found it after I made this "correction".

A Great Quote and Idea

2014-05-22T09:36:00.000-05:00

I was browsing through Quora today and found this quote, on the topic of What are the top 10 things that we should be informed about in life?:

"Bear in mind... that your opponent in any debate is not the other person, but ignorance."
– Justin Freeman, Source

What a great quote. It embodies many lessons all in one. Don't focus on the person you're arguing with. In fact, don't argue. Persuade them away from ignorance. Brilliant quote.

How I Reacted to Heartbleed

2014-04-11T15:46:00.001-05:00

Recently a bug was revealed in OpenSSL, called Heartbleed. One of the very unfortunate aspects of this bug was the potential for all passwords and cryptographic keys on a server could have been dumped to any hacker who knew about the flaw. As users, we can't control the cryptographic keys; however, we can control our passwords.

I was able to quickly identify my key accounts at risk in order to reset each password. How? With a password management database, KeePass. I also had help from the well-documented Mashable article.

I have all of my passwords stored in a KeePass database. Each account is categorized into groups like Banking, Email, Investments, etc. (see screenshot above). Using the Mashable article, I edited the each description by adding "OpenSSL" for each account that used OpenSSL. I also ran a few online searches to determine the status of other accounts I wasn't sure about.

Once I finished that, I search through the entire database (over 240 passwords!!) for "OpenSSL" to list all the entries together. Starting with my financial accounts in Banking, Investments, and other categories, I changed each password, one-by-one.

It took about an hour, but who out there without a password management database can say they reset all their key accounts so quickly?

Now I'm not going to stop there. I'll be reseting some passwords once a week until I hear this issue is cleared up. (Thus entering "OpenSSL" permanently in each account description.) It's going to take time to re-issue those cryptographic keys and for all the certificate authorities to synchronize. For these important accounts, I don't want to risk losing control.

In the end, if you are a user of web services like me, start using a password management database right now. (I would not use LastPass since it is online. However, according to their documentation, they use forward secrecy, which is currently the best way to do these things and would prevent any true information leak.)

Start by entering your most important accounts and change the passwords to random passwords. Gradually add those you don't use as frequently or are not as important. Then set a schedule for each important password to expire on a regular basis: every 2 months, 6 months, or whatever you think is best for the account. My financial accounts are reset every 3-6 months.

Hopefully webmasters, server admins, cryptographers, and anyone else involved in this ecosystem starts to realize that we have a broken internet. Encryption technologies are failing and need a serious upgrade. In some places, we don't even have encryption, and it's harming trust. It's up to the gatekeepers to keep us safe and to promote trust - we users can only do so much.

Pizza Day Award

2014-03-20T09:42:00.002-05:00

At CPM Healthgrades, we have a monthly Pizza Day. Originally the idea was to honor employee birthdays and work anniversaries, and instead of buying each person lunch on multiple days, leadership bought everyone lunch, specifically pizza, once a month.

Before the pizza would arrive, someone would read a list of employee birthdays and anniversaries. Eventually, leadership realized that the lists were too long and dropped specific recognition. However, they still announce new employees, company news, and honor an employee of the month. Nominations are requested for employee of the month, called the Pizza Day Award, and the nominations, sometimes amusing, are read aloud. The winner receives the Golden Pizza Slice (pictured above).

This month, I was nominated. Twice! Before I get to their nominations, I need to explain a certain recurring theme.

My coworkers, Andrew and Brian, dressed as a horse and stead for Halloween. The company was having a costume contest. Andrew wore a chain mail tunic and a shield with a skull on it (both self made!). Brian wore a horse head mask, and he hefted Andrew on his back for the contest pictures. The pictures were then sent out for everyone to vote for the best costume. (Of course, I voted for them.) The horse head mask ended up in Brian's cubicle, where it still remains.

Inspired by Brian's horse head, I brought in my own horse head, a marble book end. So apparently the Analytics Team's mascot is a horse.

So with that explained, here are their nominations. First, Brian's:

I’d like to nominate Chris Swenson for a pizza day award this month for his hard work on the PDC configuration workbook. He really didn’t horse around when it came to bringing this old mare up to date, he saddled right up and did what needed to be done. Now we have a true thoroughbred of a workbook on our hands, allowing us to configure markets much more rapidly and accurately than ever before. We all really appreciate the hard work he’s put into this project and feel he deserves some recognition for his efforts.

And here's Andrew's:

I would like to nominate Chris Swenson for the pizza day award. Even though he has been yoked down by various research requests, he’s managed to plow through the workbook to make it dramatically more useful. The old version can now be put out to pasture while the improved version really gallops along like the true stallion it is. His efforts have reduced the PDC analysts work load by several hours per request. He’s really a horse of a different color.

At Pizza Day, the reader was instructed to take careful note of the theme. I was selected, and I wonder if it was mostly because of the amusing nominations. At any rate, I received the Golden Slice Award:

It reads:

As a token of our appreciation, please accept this certificate for your outstanding performance and contribution. Your hard work and dedication to this company makes you a ambassador of what CPM stands for. Not only does your attitude make a positive impact on your peers, but it sets the tone for a pleasant and productive work place. Keep up the good work!

Along with the Golden Slice came a $50 gift card!

I appreciate the recognition, especially since the work out team does is quite well hidden, in databases and used in software. Thanks Andrew and Brian!

Bad SQL Writing Put to Good Use

2014-01-29T17:31:00.001-06:00

There's a certain style of writing SQL that I really don't like. Here's an example that pulls names and addresses for people in Wisconsin and Illinois:

select p.name, a.address
from person p, person_address pa, address a
where p.person_id = pa.person_id
and pa.address_id = a.address_id
and a.state in ('WI', 'IL')
;

Basically, the author has stacked all the tables into the FROM statement, and specified how they join on the WHERE statement. This creates confusion about how the tables are intended to be joined as well as mixing actual filter criteria with the join conditions. However, it works, since the code results in inner joins between all tables, and that was okay.

My preferred style is like so:

select p.name, a.address
from person p
left join person_address pa
on p.person_id = pa.person_id
inner join address a
on pa.address_id = a.address_id
where a.state in ('WI', 'IL')
;

It's a bit more verbose, but that helps the reader. This style splits out the tables into different statements and results in clearly indicated join types and join fields. It is clear to a reader of the code the intention of joining each table. The filter criteria are located in the WHERE statement without any other statements to confuse them with. There are still cases where the result may not be as expected based on the filter criteria, but it's easier to debug.

Overall, the first example is a confusing style to use, and it can cause trouble if the joins were intended to be outer joins and were not, because the style does not have a way to specify outer joins. (LEFT JOIN is short for LEFT OUTER JOIN, which means, basically, return all records from the first table, and any data that matches in the next without missing any records from the first.)

The other day, though, I encountered a great way to use this potentially error-prone style in a way that is actually very useful.

I wanted to generate a master list of all potential ZIP codes in the US, and then filter out ones that are not in use or are otherwise invalid. I started by creating a small table with 10 rows that consist of 1 column with the numbers 0-9. With no loop statements available in SQL, I wrote this table like so:

create temporary table num (n int);
insert into num values (0);
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
delete from num where n > 9;

That last statement is just in case I ran it too many times. I was even too lazy to write out 1-9, instead just repeating the max+1 code 9 times. It's a bit over-the-top, but it works.

To get my "master" list of ZIP codes, I joined the table to itself 5 times, one for each character in the ZIP code (because they can start with 0s, they should be treated as characters, not numbers!). Here's how:

create table master_zip as
select n1.n||n2.n||n3.n||n4.n||n5.n as zip
from num n1, num n2, num n3, num n4, num n5
order by 1
;

Simple, isn't it? Essentially, this just makes a big Cartesian product of the table with itself times 4. There's no WHERE statement, because there's no need to join on anything. (If a system required it, I would just write "where 1=1".) This generates 100,000 records. That's about 58,000 too many, according the US Postal Service, so we need to delete some of those that are not in use. But that process is for another time.

Using a style that I usually do not recommend or warn against was interesting and useful, but it requires knowing a bit more about how these things work. Had I run this last bit of code on a larger data set, it would have caused a lot of problems, like running out of disk space or RAM. So use this style sparingly and carefully!

Happy Birthday to Me, From Me, Sorta

2013-12-06T12:06:00.002-06:00

A few years ago, I worked at a company that uses SAS, and has a SAS server set up. I figured out that scheduling a SAS job was a bit clunky in Windows, so I mitigated the problem by setting up a SAS program run any tasks on a schedule, and then scheduling that task in Windows.

Since the program was evaluating dates and times to schedule tasks, I threw in a few happy birthday emails as a joke. A few years later, I woke up today to find this in my email inbox:

Chris,

Happy birthday! I hope you have a great day!

Sincerely,

The [Company Name] SAS Server

Well, I'm surprised that's still running! At least it wasn't sent from my old account on that server: they had at least updated who was running the job. But I'm a bit surprised they still let this go. Either they don't realize that the scheduling code contains the happy birthday emails, or they don't mind.

As a side note, the program also ran CheckLog, which reviews the SAS log output for issues, and it emailed the author about the status of the program. This was better than running CheckLog within a program, since the program could crash before it even got to the end, resulting in no status update when it was the most critical.

At any rate, it's good to know what I wrote is still useful, even if it contains a bit of questionable code!

What is a Data Scientist?

2013-11-12T12:52:00.004-06:00

Today on Quora, someone asked, "What are some software and skills that every Data Scientist should know?". I wrote the following as a response, reflecting on my current position and the role I play.

I started adding post-it notes with sub-titles to my name/title tag on my cube, as sort of a joke regarding the question, "What is a Data Scientist?".

Here's the current list:

[Client] Analyst
[Product A] Analyst
Financial Analyst
Sales Analyst
Contract Analyst
Quality Assurance Analyst
Call Center Analyst
Data Surgeon (aka, data mining with the intent to figure out what's wrong)
Data Diagnostician (alternative of above, maybe with no details to examine)
[Product B] Analyst
Database Developer
Bug Finder (as in software bugs)

So it would appear from this list that there isn't a lot of data science going on. And that's partially true.

Each of our clients has its own relational database, so we do "meta-queries" to access them one by one in order to answer a question. That's sort of data science like. Eventually, though, we're going to have one master database with all clients that will cascade into individual databases. So our "meta-queries" will be obsolete.

We deal with a lot of "big data" too, but it's usually not that big of a deal. Even with relational databases, it's okay. Some queries may take a little longer (30-60 minutes), but that's rare. We have some machine learning tasks that pull in massive training data sets, so at that point you have to be more careful about "big data" problems like running out of RAM or disk space. But it can be handled, and rather simply.

What I really wish I could do more of is machine learning, and while I've accumulated several ideas that would enhance products or help us make better decisions in the year I've been a Data Scientist, these other tasks take up most of my day.

In the end, I write a lot of SQL, use the Linux command line moderately, and report on data in Excel spreadsheets. I use Python occasionally to write scripts. And I'm always learning something new (new SQL techniques, Python libraries, Linux command line tools, etc.).

Redundant SQL

2013-08-23T15:50:00.000-05:00

Today I wrote some SQL that, when spoke, sounded like "select client from client as client", written:

select (select client from client) as client ...

Then I thought, could I actually write something even more ridiculous? What about "select select as as from from", and so I came up with:

create table "from" as select 'select' as "select" from dummy;
select "select" as "as" from "from";

The result is a one-cell table:

as
------
select

The table "dummy" is an empty table, used for "selecting" from something not actually in any table. The double quotes make the SQL execute even though it is using special keywords.

I tried adding a "group by group" and an "order by order", but that didn't work as I would have thought. I guess I can take this silliness only so far.

Lego Raspberry Pi Cases

2013-03-17T08:57:00.002-05:00

I purchased a Raspberry Pi last year, and inspired by a girl in Britain who assembled her own Pi case, I spent way too much time making my own Lego case:

I made it so it could be mounted on the wall. Additionally, the left side opens to expose the GPIO pins, although I doubt you could fit something on them without having to remove pieces from the case.

A friend of mine also has a Pi, but he was using a plastic food container for its housing. I simply could not stand for this, so I created another case for him:

This one doesn't mount on the wall. I think he's just using it on an entertainment center near the cable modem box. Unlike my case, it opens fully, exposing the entire board. I forgot to take a picture of this feature, but you can see the hinges on one of the side photos.

Both of these cases use some classic Lego pieces that I received from my cousin when I was a kid. The old-school computer terminals in blue and grey and the pieces with space logos came from this set, and I mixed them with some newer space buttons and circuitry. The glass-like covers are also newer, but have a similar hokey space feel to them.

It was fun putting my old Legos to use again!

My Professional Network, Graphically

2013-02-06T12:24:00.003-06:00

A few months ago, I was thinking about my professional network and wondered what it would look like if it were mapped out. It turns out, LinkedIn has a lab project to do just that. I loaded up my profile and a few minutes later, this came out (click for larger version):

Each dot in the above chart represents a person, and each line represents a connection between people. The larger the dot, the closer the relationship. I am at the center, and unique clusters of people become apparent by their interrelationships, and they are grouped together in space.

The lab does not label the clusters, but it does identify the clusters by color, and it allows the user to identify those clusters and name them, as I have done above. Further, you can explore your network by hovering over the individual dots that represent people.

Essentially, five groups arise from my network: Family, Friends, and Educators on the left; Dean Health Plan at the top; CPM Healthgrades at the upper right; the UW Health System, which is composed of the University of Wisconsin Medical Foundation (UWMF), the University of Wisconsin Hospitals and Clinics (UWHC), and the University of Wisconsin School of Medicine and Public Health (UWSMPH); and finally the SAS Institute at the bottom.

The lab isn't perfectly accurate, but it is pretty good. I checked out a number of individuals and some don't make sense, but most do. As an example, my wife is one of the larger dots on the left, which makes sense, since she also has networked with our friends and family on LinkedIn (although I usually avoid doing so for a number of reasons).

The UW Health cluster is visually split, but there is apparent movement and interrelationship between the organizations. There are some people who traveled between the UW and Dean, one way or another. The same is true between Dean and CPM, with most, I believe coming from Dean to CPM. There are some hubs in each organization, likely managers, project managers, or other people who attended a lot of meetings (I think one of the big dots at Dean was an IT manager who sat in on a lot of projects).

Aside from family and friends, SAS is probably the oddest group. I have some connections, and they seem to be somewhat strong. Before I attended the SAS Global Forum, this chart may have looked quite different, since I made many more connections after the conference. The chart also shows how well-developed my connections were at Dean and the UW, and how I'm still fairly new at CPM. (I bet if I ran this today it would look a bit better developed.)

Of course, charts like these leave out people who aren't on such sites as LinkedIn, but I would think that all the other people would compensate for them when graphed like this. Additionally, I don't have much of a network for old jobs like the ones I had in college, nor have I really networked much with fellow college classmates.

It would be interesting to see what other people's networks look like, especially people who are essentially professional networkers, like HR professionals or recruiters. How do networks in different industries look (mine is mostly health care)? What if you have someone who only networks with family or friends? What does that show? Perhaps different geographic locations you've lived in? How about someone who is a world traveler?

This type of graph is very powerful in that makes you think about the data behind it and ask such questions as I have done. It opens up doors we haven't thought of and inspires curiosity.

An Amended Quote

2012-12-12T13:45:00.001-06:00

Today on The Writer's Almanac, Garrison Keillor read a quote by Gustave Flaubert:

"I spent the morning putting in a comma and the afternoon removing it."

I thought this could use some amending for programmers:

I spent the morning putting in a parenthesis and the afternoon removing it.

Or for you SAS programmers:

I spent the morning putting in a semicolon and the afternoon removing it.

How about SQL:

I spent the morning putting in a column and the afternoon removing it.

Or if you write HTML:

I spent the morning putting in an escaped ampersand and the afternoon removing it.

Why not Python:

I spent the morning importing a library and the afternoon removing it.

Essentially, though, these all boil down to:

I spent the morning putting in (an) arbitrary character(s) and the afternoon removing it.

For all you programmers during this holiday season: May your parentheses be matched (your quotes too!), special characters escaped, and code executable on the first run.

Two Special Notes

2012-09-22T08:22:00.000-05:00

A few weeks ago we were packing up all our stuff to move to our new house. During this process, I found two notes from former coworkers, delivered on my last day of work at different jobs. Here's the first:

Here's the transcription in case the image does not appear:

Dear Chris —
     I have enjoyed watching you learn and grow over the last couple of years. You have become and excellent analyst — UWMF is lucky to have you. I will miss your superb work, smiling face and Luca stories! Best of luck to you on this exciting step in your career! I wish you, Kyra and Luca much happiness!
     Sincerely,
     [name]

And here's the second:

with the text:

Chris
     I really appreciate the time and dedication and professionalism that you offered to the [organization's] projects and my RA project in particular. You pressed on to make this all possible, and I am grateful. Regrets for the frustrations along the way, but know that your contribution will be remembered.
    All the best with your next personal + professional pursuits! Congrats on the well deserved Scientist position!
    Thank you,
    [name]

I kept both notes because they are very meaningful to me, and I think this highlights how important it is to write hand-written notes of thanks to you colleagues: It can be touching, powerful, and memorable for the recipient, and the writer has earned a relatively permanent place in memory. It's a perfect way to not only show genuine gratitude, but also earn a positive place in the recipient's mind, whether for the purpose of networking, friendship, or simple kindness.

New House

2012-09-07T12:09:00.003-05:00

I haven't been posting much lately because we've been looking for, buying, and moving to a new house! Hopefully once things settle down I'll be back to posting as usual. Stay tuned!

An Attempt to Answer a Quiz Question

2012-06-19T07:40:00.000-05:00

Kyra heard the NPR Weekend Edition Sunday puzzle on the radio the other day: "Think of a common French word that everyone knows. Add a 'v' (as in 'violin') to the beginning and an 'e' at the end. The result will be the English-language equivalent of the French word. What is it?"

I thought, well, I don't know French, but I do know how to write a Python program that can look up the answer. Unfortunately, the puzzle appears to be a bit of a trick. I thought as much as I was writing the code. I was thinking, "I bet some people are going to be mad about the answer, because the question was worded in such a way as to be tricky."

At any rate, the code definitely does not result in the right answer, but it was a good exercise for me. First, I had to find two dictionaries robust enough to have a lot of words in English and French. Then I had to find dictionaries: I located the Debian operating system dictionary on my local computer, did a search in the package manager Synaptic for the French equivalent and any other English dictionaries that might help, and downloaded both. (The English alternative is labeled "insane" for its size, which "possibly contains invalid words (as well as words that are very uncommon)." [1])

Next I had to convert certain French characters into their English equivalent. To answer the question, I looked up all English words that start with "V" and end with "E", then removed those two letters, looking up the "word" in the modified French dictionary. It didn't work, as can be told from its output:

a
a
ah
aire
alu
an
ange
ares
as
er
erg
es
es
il
o
t

Yep, those are definitely not the answers. Some, when you add V and E back, aren't even really English words. "Vte"? What is that? Oh, and I think A is listed twice due to the missing diacritics.

For anyone who's interested, I posted the code online. (It's not pretty posting a lot of code on Blogger, so here it is in a document.) Keep in mind I'm a new programmer, so I might not have done this task very elegantly. It is a short piece of code, though, and it could have worked.

One other note: If you attempt to run it, it may not work on your system. It's designed for Debian GNU/Linux, as it relies on dictionaries in a Debian system and it uses the "/" as the directory delimiter. Additionally, the "insane" dictionary makes it run a bit slow, and I did nothing in the code to help with that, like buffering the file, so you could run out of RAM or peak the CPU. It works on my system, which is all that mattered to me at the time.

Of course, a more robust program should be more careful and actually take user input in some fashion. Perhaps look up words that with other characters or use different languages. As it is, it's more of a "script" that does one thing and non-interactively. In any case, it doesn't even return the "right" answer!

[1] http://packages.debian.org/squeeze/wamerican-insane

Update in May 2012

2012-05-27T14:29:00.000-05:00

Last year around this time I had decided to take a risk and leave my position at the University of Wisconsin Medical Foundation for a position at the UW School of Medicine, where I was hoping to be more involved with research. This endeavor hasn't turned out as I expected, so I've decided to move on. Starting later this month, I'll be working with CPM HealthGrades, a marketing firm that is offering more and more data services to their customers.

I accepted the position at CPM HealthGrades as a Data Scientist. This is a pretty cool opportunity in a relatively new field of data analysis, using some of the same technologies behind IBM's Watson computer, the computer that bested two top-rank Jeopardy winners. I'll be learning the programming languages Python and R, and I'm really excited to put these skills to commercial and research use.

As you may know, I attended the SAS Global Forum conference last month, and I plan to keep up with the SAS community as well, by writing papers, presenting, staying in touch with user groups, and maybe even writing a book. So I plan on keeping my SAS site up-to-date; however, I may not be developing much new. Perhaps I'll come out with a Python or R site to go along with it, with translations of code back and forth between the languages. Heck, maybe I'll write a SAS macro to write Python, or a Python module to write SAS. We'll see! Stay tuned for more details!

SAS and Keyboard Shortcuts

2012-04-18T12:01:00.001-05:00

Base SAS has an easy way to add keyboard shortcuts that not only do X typical keyboard shortcut task, but also Y SAS program task (e.g., execute a macro). This can be a bit tricky to get to work right, but I've done it with CheckLog, PrintContents, OpenTable, and other interactive macros. However, I can't seem to figure out how to do some of these same things in Enterprise Guide.

Sure, it's easier in Enterprise Guide to open a table that's been run, but what about that library with 10,000 tables where I want to open only 1? It's like searching for a needle in a haystack, not to mention the load time just to see the tables in the library (very slow at 10k tables). It's so much easier to use OpenTable, and it's even easier to use OpenTable via keyboard shortcut. I have another keyboard command to open the properties of the table (a window that is quite severely limited in SAS EG, so perhaps that wouldn't be quite as useful anyway).

And the windowing commands seem to be gone, too. I can't edit a keyboard shortcut to jump from the program to the log: In Base SAS, I use CTRL + L for the log, CTRL + J for the program ("J" only because it's easy to reach and on the same keyboard line as L), and CTRL + K for executing the CheckLog macro ("K" for checK, I guess--again, it's on the same keyboard line as J and L). Just to note: I also have the CAPS key remapped to CTRL, so that's even easier to reach for my ~~lazy~~ efficient hands.

If anyone has experience with doing these things in Enterprise Guide, I'd love to hear feedback on how you get things like this to work - or if you just gave up and went with the new flow. My searches online have been fruitless up to this point, so I wanted to have this post available, at least for a point of contact in case someone else is wondering about the same things I am.

Another Odd SAS Error Message

2012-04-09T15:22:00.000-05:00

I encountered the following error message last week:

"ERROR: Invalid value for width specified - width out of range"

The code that generated it was using SQL and the PUT function to convert an ID using a format, like so:

compress(put(id, idfmt.), '. ')

Where the IDFMT format originated as a user-created format, stored in a data set. The quick and dirty solution is to add the "?" modifier to the PUT function, which is oddly documented on the INPUT function but not the PUT function:

compress(put(id, ? idfmt.), '. ')

Now why wouldn't that be documented on the PUT function page?

And what about my original problem? I've tried varying the lengths of the input column, the format, removing the compress statement, and switching to TRANWRD or PRXCHANGE. I also tried to identify the records involved, but with 35.6 million records, that was far to slow. I know that the issue occurred in the second half of the data, but beyond that is too time consuming to look for the needle in the haystack.

Finally, I figured out a solution: The original format is based on an input data set, using the CNTLIN= option on PROC FORMAT to generate the format on the fly. The data set does not have the HLO (High/Low/Other) flag to indicate what non-matching (Other) starting values would be labeled as. I added it like so:

data custom_format;

    /* Output the original records */

    set custom_format end=end;

    output;

    /* Output an additional record */

    if end then do;

        start=.;

        label=.;

        fmtname='idfmt';

        hlo='O';

        output;

    end;

run;

Then the code using the format worked, and to boot it no longer needed the COMPRESS function to remove the blanks and numeric missing values (".").

Hopefully anyone else with a similar issue will find that one of the above solutions fits their need.