Tuesday, November 25, 2014

Strange SAS Error Message

I spent far too long trying to debug a strange error message in SAS. The solution ended up being aggravatingly simple, but arriving at the solution was not. So I wanted to share the problem and solution. (I discovered this issue using SAS 9.1.3.)

MPRINT(TRANSPOSE): proc datasets nolist;
SYMBOLGEN: Macro variable OUT resolves to lib.table_name
NOTE: Line generated by the invoked macro "TRANSPOSE".
161 modify &OUT; label &NEWVAR = "&NEWLBL"; quit;

MPRINT(TRANSPOSE): modify lib.table_name;
NOTE: Line generated by the macro variable "OUT".
161 lib.table_name
NOTE: Enter RUN; to continue or QUIT; to end the procedure.

SYMBOLGEN: Macro variable NEWVAR resolves to count
SYMBOLGEN: Macro variable NEWLBL resolves to count: prevalence
MPRINT(TRANSPOSE): label count = "count: prevalence";

ERROR 200-322: The symbol is not recognized and will be ignored.
ERROR 22-322: Expecting a name.

NOTE: Statements not processed because of errors noted above.
NOTE: The SAS System stopped processing this step because of errors.
NOTE: SAS set option OBS=0 and will continue to check statements. This may cause NOTE: No observations in data set.
NOTE: PROCEDURE DATASETS used (Total process time):
      real time 0.00 seconds
      cpu time 0.00 seconds

If you start looking where the error statements start and look down, you'll miss the true source of the error.

So for those not familiar with PROC DATASETS, one of the options on the PROC DATASETS statement is the LIBRARY= option. It defaults to WORK. So this PROC DATASETS is looking only in WORK for the dataset lib.table_name. Since "lib." is not a valid part of a table name (and just the table name), it's not parsing it correctly, resulting in an error.

The error seems to cause the SAS code parser to trip and read the rest of the code incorrectly, and it looks like the following step is where the error occurs. However, it's actually the earlier step ("modify &OUT" or the resolved code, "modify lib.table_name").

The lesson is twofold: 1) Don't try to use an output library with the table name (e.g., "lib.table_name") when the LIBRARY= option on PROC DATASETS is not used. 2) Don't write a macro where an output data set can be specified (e.g., OUT=) and use PROC DATASETS without checking whether the user specified an output library. Alternatively, don't use PROC DATASETS. (There are some advantages to PROC DATASETS, so don't throw the baby out with the bathwater!)

I wasted a lot of time on this since I thought the macro could handle any type of input/output specified, but apparently that is not true. The solution for me was to remove the "lib." from the OUT= argument to the macro ("&OUT"). Since PROC DATSETS was looking in work, it found it after I made this "correction".

Thursday, May 22, 2014

A Great Quote and Idea

I was browsing through Quora today and found this quote, on the topic of What are the top 10 things that we should be informed about in life?:

"Bear in mind... that your opponent in any debate is not the other person, but ignorance."
– Justin Freeman, Source

What a great quote. It embodies many lessons all in one. Don't focus on the person you're arguing with. In fact, don't argue. Persuade them away from ignorance. Brilliant quote.

Friday, April 11, 2014

How I Reacted to Heartbleed

Recently a bug was revealed in OpenSSL, called Heartbleed. One of the very unfortunate aspects of this bug was the potential for all passwords and cryptographic keys on a server could have been dumped to any hacker who knew about the flaw. As users, we can't control the cryptographic keys; however, we can control our passwords.

I was able to quickly identify my key accounts at risk in order to reset each password. How? With a password management database, KeePass. I also had help from the well-documented Mashable article.

I have all of my passwords stored in a KeePass database. Each account is categorized into groups like Banking, Email, Investments, etc. (see screenshot above). Using the Mashable article, I edited the each description by adding "OpenSSL" for each account that used OpenSSL. I also ran a few online searches to determine the status of other accounts I wasn't sure about.

Once I finished that, I search through the entire database (over 240 passwords!!) for "OpenSSL" to list all the entries together. Starting with my financial accounts in Banking, Investments, and other categories, I changed each password, one-by-one.

It took about an hour, but who out there without a password management database can say they reset all their key accounts so quickly?

Now I'm not going to stop there. I'll be reseting some passwords once a week until I hear this issue is cleared up. (Thus entering "OpenSSL" permanently in each account description.) It's going to take time to re-issue those cryptographic keys and for all the certificate authorities to synchronize. For these important accounts, I don't want to risk losing control.

In the end, if you are a user of web services like me, start using a password management database right now. (I would not use LastPass since it is online. However, according to their documentation, they use forward secrecy, which is currently the best way to do these things and would prevent any true information leak.)

Start by entering your most important accounts and change the passwords to random passwords. Gradually add those you don't use as frequently or are not as important. Then set a schedule for each important password to expire on a regular basis: every 2 months, 6 months, or whatever you think is best for the account. My financial accounts are reset every 3-6 months.

Hopefully webmasters, server admins, cryptographers, and anyone else involved in this ecosystem starts to realize that we have a broken internet. Encryption technologies are failing and need a serious upgrade. In some places, we don't even have encryption, and it's harming trust. It's up to the gatekeepers to keep us safe and to promote trust - we users can only do so much.

Thursday, March 20, 2014

Pizza Day Award

At CPM Healthgrades, we have a monthly Pizza Day. Originally the idea was to honor employee birthdays and work anniversaries, and instead of buying each person lunch on multiple days, leadership bought everyone lunch, specifically pizza, once a month.

Before the pizza would arrive, someone would read a list of employee birthdays and anniversaries. Eventually, leadership realized that the lists were too long and dropped specific recognition. However, they still announce new employees, company news, and honor an employee of the month. Nominations are requested for employee of the month, called the Pizza Day Award, and the nominations, sometimes amusing, are read aloud. The winner receives the Golden Pizza Slice (pictured above).

This month, I was nominated. Twice! Before I get to their nominations, I need to explain a certain recurring theme.

My coworkers, Andrew and Brian, dressed as a horse and stead for Halloween. The company was having a costume contest. Andrew wore a chain mail tunic and a shield with a skull on it (both self made!). Brian wore a horse head mask, and he hefted Andrew on his back for the contest pictures. The pictures were then sent out for everyone to vote for the best costume. (Of course, I voted for them.) The horse head mask ended up in Brian's cubicle, where it still remains.

Inspired by Brian's horse head, I brought in my own horse head, a marble book end. So apparently the Analytics Team's mascot is a horse.

So with that explained, here are their nominations. First, Brian's:
I’d like to nominate Chris Swenson for a pizza day award this month for his hard work on the PDC configuration workbook. He really didn’t horse around when it came to bringing this old mare up to date, he saddled right up and did what needed to be done. Now we have a true thoroughbred of a workbook on our hands, allowing us to configure markets much more rapidly and accurately than ever before. We all really appreciate the hard work he’s put into this project and feel he deserves some recognition for his efforts.
And here's Andrew's:
I would like to nominate Chris Swenson for the pizza day award. Even though he has been yoked down by various research requests, he’s managed to plow through the workbook to make it dramatically more useful. The old version can now be put out to pasture while the improved version really gallops along like the true stallion it is. His efforts have reduced the PDC analysts work load by several hours per request. He’s really a horse of a different color.
At Pizza Day, the reader was instructed to take careful note of the theme. I was selected, and I wonder if it was mostly because of the amusing nominations. At any rate, I received the Golden Slice Award:

It reads:
As a token of our appreciation, please accept this certificate for your outstanding performance and contribution. Your hard work and dedication to this company makes you a ambassador of what CPM stands for. Not only does your attitude make a positive impact on your peers, but it sets the tone for a pleasant and productive work place. Keep up the good work!
Along with the Golden Slice came a $50 gift card!

I appreciate the recognition, especially since the work out team does is quite well hidden, in databases and used in software. Thanks Andrew and Brian!

Wednesday, January 29, 2014

Bad SQL Writing Put to Good Use

There's a certain style of writing SQL that I really don't like. Here's an example that pulls names and addresses for people in Wisconsin and Illinois:

select p.name, a.address
from person p, person_address pa, address a
where p.person_id = pa.person_id
and pa.address_id = a.address_id
and a.state in ('WI', 'IL')

Basically, the author has stacked all the tables into the FROM statement, and specified how they join on the WHERE statement. This creates confusion about how the tables are intended to be joined as well as mixing actual filter criteria with the join conditions. However, it works, since the code results in inner joins between all tables, and that was okay.

My preferred style is like so:

select p.name, a.address
from person p
left join person_address pa
on p.person_id = pa.person_id
inner join address a
on pa.address_id = a.address_id
where a.state in ('WI', 'IL')

It's a bit more verbose, but that helps the reader. This style splits out the tables into different statements and results in clearly indicated join types and join fields. It is clear to a reader of the code the intention of joining each table. The filter criteria are located in the WHERE statement without any other statements to confuse them with. There are still cases where the result may not be as expected based on the filter criteria, but it's easier to debug.

Overall, the first example is a confusing style to use, and it can cause trouble if the joins were intended to be outer joins and were not, because the style does not have a way to specify outer joins. (LEFT JOIN is short for LEFT OUTER JOIN, which means, basically, return all records from the first table, and any data that matches in the next without missing any records from the first.)

The other day, though, I encountered a great way to use this potentially error-prone style in a way that is actually very useful.

I wanted to generate a master list of all potential ZIP codes in the US, and then filter out ones that are not in use or are otherwise invalid. I started by creating a small table with 10 rows that consist of 1 column with the numbers 0-9.  With no loop statements available in SQL, I wrote this table like so:

create temporary table num (n int);
insert into num values (0);
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
insert into num select max(n)+1 from num;
delete from num where n > 9;

That last statement is just in case I ran it too many times. I was even too lazy to write out 1-9, instead just repeating the max+1 code 9 times. It's a bit over-the-top, but it works.

To get my "master" list of ZIP codes, I joined the table to itself 5 times, one for each character in the ZIP code (because they can start with 0s, they should be treated as characters, not numbers!). Here's how:

create table master_zip as
select n1.n||n2.n||n3.n||n4.n||n5.n as zip
from num n1, num n2, num n3, num n4, num n5
order by 1

Simple, isn't it? Essentially, this just makes a big Cartesian product of the table with itself times 4. There's no WHERE statement, because there's no need to join on anything. (If a system required it, I would just write "where 1=1".) This generates 100,000 records. That's about 58,000 too many, according the US Postal Service, so we need to delete some of those that are not in use. But that process is for another time.

Using a style that I usually do not recommend or warn against was interesting and useful, but it requires knowing a bit more about how these things work. Had I run this last bit of code on a larger data set, it would have caused a lot of problems, like running out of disk space or RAM. So use this style sparingly and carefully!