One of the issues that my CheckLog macro detects is the following statement:
"The query requires remerging summary statistics back with the original data."
What's going on here? Here's an example of code that generates this statement:
proc sql;
create table patient_bp as
select
a.patient_id
, a.year
, b.exam_date
, avg(b.bp_systolic) as bp_systolic_avg
, avg(b.bp_diastolic) as bp_diastolic_avg
from patient a
left join vitals b
on a.patient_id=b.patient_id
where year(b.exam_date) >= (a.year-2)
group by a.patient_id, a.exam_date
;
quit;
Two tables are joined, one that appears to be at the level of patient
and year and another at patient and exam date. The table seeks to find
any exam date greater than two years prior to the year in the patient
table. The GROUP BY statement includes patient ID and exam date.
However, the variable year is specified in the select statement.
Given that an extra grouper is in the data, the above query must first calculate:
select a.patient_id, b.exam_date, avg(b.bp_systolic), avg(b.bp_diastolic)
Since the level of data, though, includes year, it must then do the following:
select a.patient_id, a.year, b.exam_date, bp_systolic_avg, bp_diastolic_avg
Essentially, a mini-query needs to execute to create the summary
variables (averages), then another mini-query to add that summary data
to the main data.
So what's the issue? This seems useful. Well, the "remerging" note can
obscure data issues within the data that the coder either ignored or did
not expect, since only one of these statements can be generated for a
SQL query. For example, the query above actually generates duplicates at
the patient, year, and exam date levels, and the remerging note hints
at this issue. Normally, this does not happen with "clean" data, but we
rarely have clean data. Either the patient table had more than one
record per patient and year, or the vitals table had more than one
record per patient and exam date, which could easily occur in our system
and is valid.
In the end, there are two ways to fix this and avoid the message in the
log: Either split the query, like below, or add year to the GROUP BY
statement (if that is acceptable - here it is, but that's not always the
case). The split query is a little longer, but I've never noticed that
such queries take much longer.
proc sql;
create table _temp_ as
select
a.patient_id
, b.exam_date
, avg(b.bp_systolic) as bp_systolic_avg
, avg(b.bp_diastolic) as bp_diastolic_avg
from patient a
left join vitals b
on a.patient_id=b.patient_id
where year(b.exam_date) >= (a.year-2)
group by a.patient_id, a.exam_date
;
create table patient_bp as
select
a.patient_id
, a.year
, b.exam_date
, b.bp_systolic_avg
, b.bp_diastolic_avg
from patient a
left join _temp_ b
on a.patient_id=b.patient_id
;
drop table _temp_;
quit;
Note that this is essentially replicating what SQL is doing in the
back-end, but since it is split out, it is obvious that the programmer
knew about the potential issue and purposefully coded it that way to
avoid it.
Someone suggested adding "distinct" to the select statement, but the
message regarding remerging does not go away when this solution is used.
Thus, any other unexpected data issues may continue to be masked by
this statement, since only one can be generated for each SQL step.
Any time you're doing data transformation, it's always best to test your
assumptions. A good macro for testing the level of the data (input and
output) is the DupCheck macro. Adding this one line of code would have
told the programmer something was amiss:
%DupCheck(patient_bp, patient_id exam_date);
The DupCheck macro and its documentation is available online.
Good luck, and be wary of remerging!
More on the "Remerge" message is in this SAS Note:
ReplyDeletehttp://support.sas.com/kb/4/308.html
And you can turn the WARNING into an ERROR if you want to be extra strict about the situation, using the NOREMERGE option on PROC SQL
Ah, I should keep that in mind when setting up project autoexecs...
Delete