Monday, January 23, 2012

SQL Remerging

One of the issues that my CheckLog macro detects is the following statement:

"The query requires remerging summary statistics back with the original data."

What's going on here? Here's an example of code that generates this statement:

proc sql;
    create table patient_bp as
    select
          a.patient_id
        , a.year
        , b.exam_date
        , avg(b.bp_systolic) as bp_systolic_avg
        , avg(b.bp_diastolic) as bp_diastolic_avg
    from patient a
    left join vitals b
    on a.patient_id=b.patient_id
    where year(b.exam_date) >= (a.year-2)
    group by a.patient_id, a.exam_date
    ;
quit;

Two tables are joined, one that appears to be at the level of patient and year and another at patient and exam date. The table seeks to find any exam date greater than two years prior to the year in the patient table. The GROUP BY statement includes patient ID and exam date. However, the variable year is specified in the select statement.

Given that an extra grouper is in the data, the above query must first calculate:

    select a.patient_id, b.exam_date, avg(b.bp_systolic), avg(b.bp_diastolic)

Since the level of data, though, includes year, it must then do the following:

    select a.patient_id, a.year, b.exam_date, bp_systolic_avg, bp_diastolic_avg

Essentially, a mini-query needs to execute to create the summary variables (averages), then another mini-query to add that summary data to the main data.

So what's the issue? This seems useful. Well, the "remerging" note can obscure data issues within the data that the coder either ignored or did not expect, since only one of these statements can be generated for a SQL query. For example, the query above actually generates duplicates at the patient, year, and exam date levels, and the remerging note hints at this issue. Normally, this does not happen with "clean" data, but we rarely have clean data. Either the patient table had more than one record per patient and year, or the vitals table had more than one record per patient and exam date, which could easily occur in our system and is valid.

In the end, there are two ways to fix this and avoid the message in the log: Either split the query, like below, or add year to the GROUP BY statement (if that is acceptable - here it is, but that's not always the case). The split query is a little longer, but I've never noticed that such queries take much longer.

proc sql;
    create table _temp_ as
    select
          a.patient_id
        , b.exam_date
        , avg(b.bp_systolic) as bp_systolic_avg
        , avg(b.bp_diastolic) as bp_diastolic_avg
    from patient a
    left join vitals b
    on a.patient_id=b.patient_id
    where year(b.exam_date) >= (a.year-2)
    group by a.patient_id, a.exam_date
    ;

    create table patient_bp as
    select
          a.patient_id
        , a.year
        , b.exam_date
        , b.bp_systolic_avg
        , b.bp_diastolic_avg
    from patient a
    left join _temp_ b
    on a.patient_id=b.patient_id
    ;

    drop table _temp_;
quit;


Note that this is essentially replicating what SQL is doing in the back-end, but since it is split out, it is obvious that the programmer knew about the potential issue and purposefully coded it that way to avoid it.

Someone suggested adding "distinct" to the select statement, but the message regarding remerging does not go away when this solution is used. Thus, any other unexpected data issues may continue to be masked by this statement, since only one can be generated for each SQL step.

Any time you're doing data transformation, it's always best to test your assumptions. A good macro for testing the level of the data (input and output) is the DupCheck macro. Adding this one line of code would have told the programmer something was amiss:

    %DupCheck(patient_bp, patient_id exam_date);

The DupCheck macro and its documentation is available online.

Good luck, and be wary of remerging!

Tuesday, January 10, 2012

SCAPROC Bug in SAS

In my current position, we have a macro set up to run other programs, and it outputs a copy of the code, an MPRINT file, and the log. It also re-runs a project-specific autoexec to keep a copy in the log, analyzes the log for errors and warnings (using something other than my CheckLog macro), and it collects and outputs metadata. The part that finds and outputs metadata is called PROC SCAPROC. Here is how SAS describes it:
The SCAPROC procedure implements the SAS Code Analyzer, which captures information about input, output, and the use of macro symbols from a SAS job while it is running. The SAS Code Analyzer can write this information and the information that is in the original SAS file to a file that you specify.
Sounds nifty, right? Well, I've noticed some odd issues with it over time, and only recently was I able to put my finger on one of its problems.

The arrangement of code that results in the issue is rather odd and easily avoided. Nonetheless, in case anyone has encountered the problem, I thought I would post code to replicate the problem here. First, run this part in a new SAS session:

/* Set the base directory */
%let dir=%sysfunc(pathname(work));

/* Create a new folder */
%sysexec mkdir "&DIR.\lock";

/* Assign a library with a locked data set */
libname lock "&DIR.\lock";

/* Create a locked data set */
data lock.locked(pw='asdf1234');
    x=1;
run;

So we've created a new library with a locked data set based on our work library. Now, try out SCAPROC with the following code and see what happens:

/* Copy the CARS data set */
data cars;
    set sashelp.cars;
run;

/* Create a view */
proc sql;
    create view badview as
    select type
    from cars
    ;
quit;

/* Remove type from the data set */
/* This makes the view (i)nvalid */
data cars;
    set cars;
    drop type;
run;

/* Start PROC SCAPROC */
proc scaproc;
    record "&DIR.\lock\scaproc_out.txt" attr;
run;

/* Copy data out of SASHELP */
data test;
    set sashelp.vcolumn(where=(libname="WORK"));
run;

/* Write out SCAPROC */
proc scaproc;
    write;
run;

And now we've created a view from a SASHELP table and made it invalid by removing the column that the view uses in its definition. We started SCAPROC and attempted to use another SASHELP table that contains metadata about all of the columns available to SAS. Did you get a pop-up for a password at this point? If so, that's part of the bug. Next, SCAPROC ends. Take a look at the log: are there some error messages? There should be at least one regarding the invalid view. Now take a look at the output of SCAPROC. It's a bit inflated for such a small program, especially if you have a large number of libraries assigned. Assign a bunch of libraries and try the code again, and it will explode.

The issue here has a few parts: First, we need an invalid view and a fully password-protected data set. Second, we need to run SCAPROC; and finally, we need to run a DATA step against SASHELP. The invalid view causes SCAPROC, upon encountering the SASHELP.Vcolumn data set, to consider all data sets listed in the table as input to the program. Interestingly, if you replace the DATA step with SQL, the issue disappears, except for the error regarding the invalid view. Dropping the use of SCAPROC also removes the problem (obviously).

Why did the password pop-up appear? Since SCAPROC considered all data sets as input, it attempted to collect metadata on the locked data set. A sub-bug of the main bug is that SCAPROC should not query locked data sets, since it won't know the password. If you were to run it in batch mode, there wouldn't be anyone around to enter a password. Another sub-bug is that the DATA step somehow tells SCAPROC that all available data is an input.

As a consequence of this issue, we had a program run where the metadata output of SCAPROC that contained variable names was a SAS data set with 5.1 million rows at 972 MB. That was huge, and the SCAPROC output file was also huge, at 751 MB. And it's just a text file - no SAS header, column structure, anything!

I submitted the bug set to SAS, but in the meantime if anyone encounters this issue, the easiest way to fix it is to either fix the broken views, replace the DATA step with SQL, or both. Good luck, and please let me know if you encounter the issue yourself, or any related issues!

Wednesday, January 4, 2012

SAS Global Forum

I'm a SAS programmer, and I've always wanted to attend one of those big SAS conferences. So this year, when I heard about the SAS Global Forum, I applied for a grant from the University of Wisconsin and submitted two papers to present at the conference.

A few weeks ago I heard that I won the grant, and just last week I found out that both of my papers were accepted by the conference organizers. I'll be presenting both papers in the Coder's Corner section, with the following titles:

An Advanced, Multi-Featured Macro Program for Reviewing Logs
Standardized Macro Programs for Macro Variable Manipulation

If you plan on attending and you happen to follow this blog, please feel free to stop by and say hello! If not, I'll post the links to the papers as soon as their available from the conference organizers. For more information about the conference, go to http://support.sas.com/events/sasglobalforum/2012/index.html.

Hope to see you there!