Monday, January 23, 2012

SQL Remerging

One of the issues that my CheckLog macro detects is the following statement:

"The query requires remerging summary statistics back with the original data."

What's going on here? Here's an example of code that generates this statement:

proc sql;
    create table patient_bp as
    select
          a.patient_id
        , a.year
        , b.exam_date
        , avg(b.bp_systolic) as bp_systolic_avg
        , avg(b.bp_diastolic) as bp_diastolic_avg
    from patient a
    left join vitals b
    on a.patient_id=b.patient_id
    where year(b.exam_date) >= (a.year-2)
    group by a.patient_id, a.exam_date
    ;
quit;

Two tables are joined, one that appears to be at the level of patient and year and another at patient and exam date. The table seeks to find any exam date greater than two years prior to the year in the patient table. The GROUP BY statement includes patient ID and exam date. However, the variable year is specified in the select statement.

Given that an extra grouper is in the data, the above query must first calculate:

    select a.patient_id, b.exam_date, avg(b.bp_systolic), avg(b.bp_diastolic)

Since the level of data, though, includes year, it must then do the following:

    select a.patient_id, a.year, b.exam_date, bp_systolic_avg, bp_diastolic_avg

Essentially, a mini-query needs to execute to create the summary variables (averages), then another mini-query to add that summary data to the main data.

So what's the issue? This seems useful. Well, the "remerging" note can obscure data issues within the data that the coder either ignored or did not expect, since only one of these statements can be generated for a SQL query. For example, the query above actually generates duplicates at the patient, year, and exam date levels, and the remerging note hints at this issue. Normally, this does not happen with "clean" data, but we rarely have clean data. Either the patient table had more than one record per patient and year, or the vitals table had more than one record per patient and exam date, which could easily occur in our system and is valid.

In the end, there are two ways to fix this and avoid the message in the log: Either split the query, like below, or add year to the GROUP BY statement (if that is acceptable - here it is, but that's not always the case). The split query is a little longer, but I've never noticed that such queries take much longer.

proc sql;
    create table _temp_ as
    select
          a.patient_id
        , b.exam_date
        , avg(b.bp_systolic) as bp_systolic_avg
        , avg(b.bp_diastolic) as bp_diastolic_avg
    from patient a
    left join vitals b
    on a.patient_id=b.patient_id
    where year(b.exam_date) >= (a.year-2)
    group by a.patient_id, a.exam_date
    ;

    create table patient_bp as
    select
          a.patient_id
        , a.year
        , b.exam_date
        , b.bp_systolic_avg
        , b.bp_diastolic_avg
    from patient a
    left join _temp_ b
    on a.patient_id=b.patient_id
    ;

    drop table _temp_;
quit;


Note that this is essentially replicating what SQL is doing in the back-end, but since it is split out, it is obvious that the programmer knew about the potential issue and purposefully coded it that way to avoid it.

Someone suggested adding "distinct" to the select statement, but the message regarding remerging does not go away when this solution is used. Thus, any other unexpected data issues may continue to be masked by this statement, since only one can be generated for each SQL step.

Any time you're doing data transformation, it's always best to test your assumptions. A good macro for testing the level of the data (input and output) is the DupCheck macro. Adding this one line of code would have told the programmer something was amiss:

    %DupCheck(patient_bp, patient_id exam_date);

The DupCheck macro and its documentation is available online.

Good luck, and be wary of remerging!

2 comments:

  1. More on the "Remerge" message is in this SAS Note:
    http://support.sas.com/kb/4/308.html

    And you can turn the WARNING into an ERROR if you want to be extra strict about the situation, using the NOREMERGE option on PROC SQL

    ReplyDelete
    Replies
    1. Ah, I should keep that in mind when setting up project autoexecs...

      Delete

Note: Only a member of this blog may post a comment.