Monday, December 5, 2011

A SAS Macro to Convert Text to a Persistent ID

I was working on a project in which we usually carried forward a code and a description for every category to make it easier on coders to select a desired category using an ID rather than the long and sometimes-changing description. I found a particular category in which no ID existed, and there were no similar categories that I could use as the basis for a new ID. So what could I do?

In a previous position, I had the same problem crop up, and I solved the problem by doing the following:
  1. Extract one character at a time from the original categorical variable (e.g., 'A').
  2. Convert the character to its binary representation (e.g., '01000001').
  3. Convert the binary representation to the numeric value (e.g., 65).
  4. Sum the final value of each character (e.g., 'ABC' = 65 + 66 + 67 = 198).
Combined, these steps only take up 3 lines of code (excluding the data step code):

do _i_=1 to length(Category);
  ID + input(put(substr(Category, _i_, 1), $binary8.), binary8.);
end;

And the process can be run as a macro program:

http://sas.cswenson.com/downloads/macros/TextID.sas

Note: Run the code in test mode (e.g., %TextID(data, var, test=Y) ) to see each step broken out into separate columns.

The result is a unique ID that will never change for the particular configuration of characters. Keep in mind, however, that the ID is not only character-sensitive, but case sensitive, so "Category" is not the same as "category". To avoid this problem, force the category into upper- or lower-case using the UPPER or LOWER functions.

Using this process, we were able to add a constant, unchanging ID for each category. The ID will not vary depending on the number of categories, the time at which the code is run, or any other variable (other than user-error, if the categories are typed by hand!).

If anyone uses this process, please let me know what you used it for! Oh, and it would be nice if you gave me credit, too!

5 comments:

  1. Hi Chris,

    If your ID value is character, then you might be able to use the MD5 function to generate a "hash" value for the string. The MD5 function generates a 128-bit (16-hex character) value. Example:

    data withHashes;
    length ID $ 16 Description $ 20;
    infile datalines4 dsd;
    input ID Description;
    if (ID = .) then ID = put(MD5(Description),hex32.);
    datalines4;
    100, Hundred
    200, TwoHundred
    ,Huh
    300, ThreeHundred
    ,AnotherUnknown
    ;;;;
    run;

    ReplyDelete
  2. @Chris Hemedinger: Yes, I thought this was quite similar to MD5 hashing. I like the process I wrote as it only results in numeric IDs, which are a bit easier to use in code (e.g., ID = 52018).

    ReplyDelete
  3. Hello Chris,

    I think there is a problem with your function, that it make no diffences between ABC BAC BCA. I tried to run it but compress has too many arguments

    ReplyDelete
    Replies
    1. David, it sounds like you might be on an older version of SAS. What version are you running?

      Delete
  4. This comment has been removed by a blog administrator.

    ReplyDelete

Note: Only a member of this blog may post a comment.