Friday, April 11, 2014

How I Reacted to Heartbleed

Recently a bug was revealed in OpenSSL, called Heartbleed. One of the very unfortunate aspects of this bug was the potential for all passwords and cryptographic keys on a server could have been dumped to any hacker who knew about the flaw. As users, we can't control the cryptographic keys; however, we can control our passwords.

I was able to quickly identify my key accounts at risk in order to reset each password. How? With a password management database, KeePass. I also had help from the well-documented Mashable article.

I have all of my passwords stored in a KeePass database. Each account is categorized into groups like Banking, Email, Investments, etc. (see screenshot above). Using the Mashable article, I edited the each description by adding "OpenSSL" for each account that used OpenSSL. I also ran a few online searches to determine the status of other accounts I wasn't sure about.

Once I finished that, I search through the entire database (over 240 passwords!!) for "OpenSSL" to list all the entries together. Starting with my financial accounts in Banking, Investments, and other categories, I changed each password, one-by-one.

It took about an hour, but who out there without a password management database can say they reset all their key accounts so quickly?

Now I'm not going to stop there. I'll be reseting some passwords once a week until I hear this issue is cleared up. (Thus entering "OpenSSL" permanently in each account description.) It's going to take time to re-issue those cryptographic keys and for all the certificate authorities to synchronize. For these important accounts, I don't want to risk losing control.

In the end, if you are a user of web services like me, start using a password management database right now. (I would not use LastPass since it is online. However, according to their documentation, they use forward secrecy, which is currently the best way to do these things and would prevent any true information leak.)

Start by entering your most important accounts and change the passwords to random passwords. Gradually add those you don't use as frequently or are not as important. Then set a schedule for each important password to expire on a regular basis: every 2 months, 6 months, or whatever you think is best for the account. My financial accounts are reset every 3-6 months.

Hopefully webmasters, server admins, cryptographers, and anyone else involved in this ecosystem starts to realize that we have a broken internet. Encryption technologies are failing and need a serious upgrade. In some places, we don't even have encryption, and it's harming trust. It's up to the gatekeepers to keep us safe and to promote trust - we users can only do so much.

Thursday, March 20, 2014

Pizza Day Award

The Golden Pizza Slice

At CPM Healthgrades, we have a monthly Pizza Day. Originally the idea was to honor employee birthdays and work anniversaries, and instead of buying each person lunch on multiple days, leadership bought everyone lunch, specifically pizza, once a month.

Before the pizza would arrive, someone would read a list of employee birthdays and anniversaries. Eventually, leadership realized that the lists were too long and dropped specific recognition. However, they still announce new employees, company news, and honor an employee of the month. Nominations are requested for employee of the month, called the Pizza Day Award, and the nominations, sometimes amusing, are read aloud. The winner receives the Golden Pizza Slice (pictured above).

This month, I was nominated. Twice! Before I get to their nominations, I need to explain a certain recurring theme.

My coworkers, Andrew and Brian, dressed as a horse and stead for Halloween. The company was having a costume contest. Andrew wore a chain mail tunic and a shield with a skull on it (both self made!). Brian wore a horse head mask, and he hefted Andrew on his back for the contest pictures. The pictures were then sent out for everyone to vote for the best costume. (Of course, I voted for them.) The horse head mask ended up in Brian's cubicle, where it still remains.

Inspired by Brian's horse head, I brought in my own horse head, a marble book end. So apparently the Analytics Team's mascot is a horse.

So with that explained, here are their nominations. First, Brian's:
I’d like to nominate Chris Swenson for a pizza day award this month for his hard work on the PDC configuration workbook. He really didn’t horse around when it came to bringing this old mare up to date, he saddled right up and did what needed to be done. Now we have a true thoroughbred of a workbook on our hands, allowing us to configure markets much more rapidly and accurately than ever before. We all really appreciate the hard work he’s put into this project and feel he deserves some recognition for his efforts.
And here's Andrew's:
I would like to nominate Chris Swenson for the pizza day award. Even though he has been yoked down by various research requests, he’s managed to plow through the workbook to make it dramatically more useful. The old version can now be put out to pasture while the improved version really gallops along like the true stallion it is. His efforts have reduced the PDC analysts work load by several hours per request. He’s really a horse of a different color.
At Pizza Day, the reader was instructed to take careful note of the theme. I was selected, and I wonder if it was mostly because of the amusing nominations. At any rate, I received the Golden Slice Award:

It reads:
As a token of our appreciation, please accept this certificate for your outstanding performance and contribution. Your hard work and dedication to this company makes you a ambassador of what CPM stands for. Not only does your attitude make a positive impact on your peers, but it sets the tone for a pleasant and productive work place. Keep up the good work!
Along with the Golden Slice came a $50 gift card!

I appreciate the recognition, especially since the work out team does is quite well hidden, in databases and used in software. Thanks Andrew and Brian!

Wednesday, January 29, 2014

Bad SQL Writing Put to Good Use

There's a certain style of writing SQL that I really don't like. Here's an example that pulls names and addresses for people in Wisconsin and Illinois:

select, a.address
from person p, person_address pa, address a
where p.person_id = pa.person_id
and pa.address_id = a.address_id
and a.state in ('WI', 'IL')

Basically, the author has stacked all the tables into the FROM statement, and specified how they join on the WHERE statement. This creates confusion about how the tables are intended to be joined as well as mixing actual filter criteria with the join conditions. However, it works, since the code results in inner joins between all tables, and that was okay.

My preferred style is like so:

select, a.address
from person p
left join person_address pa
on p.person_id = pa.person_id
inner join address a
on pa.address_id = a.address_id
where a.state in ('WI', 'IL')

It's a bit more verbose, but that helps the reader. This style splits out the tables into different statements and results in clearly indicated join types and join fields. It is clear to a reader of the code the intention of joining each table. The filter criteria are located in the WHERE statement without any other statements to confuse them with. There are still cases where the result may not be as expected based on the filter criteria, but it's easier to debug.

Overall, the first example is a confusing style to use, and it can cause trouble if the joins were intended to be outer joins and were not, because the style does not have a way to specify outer joins. (LEFT JOIN is short for LEFT OUTER JOIN, which means, basically, return all records from the first table, and any data that matches in the next without missing any records from the first.)

The other day, though, I encountered a great way to use this potentially error-prone style in a way that is actually very useful.

I wanted to generate a master list of all potential ZIP codes in the US, and then filter out ones that are not in use or are otherwise invalid. I started by creating a small table with 10 rows that consist of 1 column with the numbers 0-9.  With no loop statements available in SQL, I wrote this table like so:

create temporary table num (n int);
insert into num values (0);
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
delete from num where n > 9;

That last statement is just in case I ran it too many times. I was even too lazy to write out 1-9, instead just repeating the max+1 code 9 times. It's a bit over-the-top, but it works.

To get my "master" list of ZIP codes, I joined the table to itself 5 times, one for each character in the ZIP code (because they can start with 0s, they should be treated as characters, not numbers!). Here's how:

create table master_zip as
select n1.n||n2.n||n3.n||n4.n||n5.n as zip
from num n1, num n2, num n3, num n4, num n5
order by 1

Simple, isn't it? Essentially, this just makes a big Cartesian product of the table with itself times 4. There's no WHERE statement, because there's no need to join on anything. (If a system required it, I would just write "where 1=1".) This generates 100,000 records. That's about 58,000 too many, according the US Postal Service, so we need to delete some of those that are not in use. But that process is for another time.

Using a style that I usually do not recommend or warn against was interesting and useful, but it requires knowing a bit more about how these things work. Had I run this last bit of code on a larger data set, it would have caused a lot of problems, like running out of disk space or RAM. So use this style sparingly and carefully!

Friday, December 6, 2013

Happy Birthday to Me, From Me, Sorta

A few years ago, I worked at a company that uses SAS, and has a SAS server set up. I figured out that scheduling a SAS job was a bit clunky in Windows, so I mitigated the problem by setting up a SAS program run any tasks on a schedule, and then scheduling that task in Windows.

Since the program was evaluating dates and times to schedule tasks, I threw in a few happy birthday emails as a joke. A few years later, I woke up today to find this in my email inbox:

Happy birthday! I hope you have a great day!


The [Company Name] SAS Server
Well, I'm surprised that's still running! At least it wasn't sent from my old account on that server: they had at least updated who was running the job. But I'm a bit surprised they still let this go. Either they don't realize that the scheduling code contains the happy birthday emails, or they don't mind.

As a side note, the program also ran CheckLog, which reviews the SAS log output for issues, and it emailed the author about the status of the program. This was better than running CheckLog within a program, since the program could crash before it even got to the end, resulting in no status update when it was the most critical.

At any rate, it's good to  know what I wrote is still useful, even if it contains a bit of questionable code!

Tuesday, November 12, 2013

What is a Data Scientist?

Today on Quora, someone asked, "What are some software and skills that every Data Scientist should know?". I wrote the following as a response, reflecting on my current position and the role I play.

I started adding post-it notes with sub-titles to my name/title tag on my cube, as sort of a joke regarding the question, "What is a Data Scientist?".

Here's the current list:
  • [Client] Analyst
  • [Product A] Analyst
  • Financial Analyst
  • Sales Analyst
  • Contract Analyst
  • Quality Assurance Analyst
  • Call Center Analyst
  • Data Surgeon (aka, data mining with the intent to figure out what's wrong)
  • Data Diagnostician (alternative of above, maybe with no details to examine)
  • [Product B] Analyst
  • Database Developer
  • Bug Finder (as in software bugs)
So it would appear from this list that there isn't a lot of data science going on. And that's partially true.

Each of our clients has its own relational database, so we do "meta-queries" to access them one by one in order to answer a question. That's sort of data science like. Eventually, though, we're going to have one master database with all clients that will cascade into individual databases. So our "meta-queries" will be obsolete.

We deal with a lot of "big data" too, but it's usually not that big of a deal. Even with relational databases, it's okay. Some queries may take a little longer (30-60 minutes), but that's rare. We have some machine learning tasks that pull in massive training data sets, so at that point you have to be more careful about "big data" problems like running out of RAM or disk space. But it can be handled, and rather simply.

What I really wish I could do more of is machine learning, and while I've accumulated several ideas that would enhance products or help us make better decisions in the year I've been a Data Scientist, these other tasks take up most of my day.

In the end, I write a lot of SQL, use the Linux command line moderately, and report on data in Excel spreadsheets. I use Python occasionally to write scripts. And I'm always learning something new (new SQL techniques, Python libraries, Linux command line tools, etc.).