In a previous position, I had the same problem crop up, and I solved the problem by doing the following:
- Extract one character at a time from the original categorical variable (e.g., 'A').
- Convert the character to its binary representation (e.g., '01000001').
- Convert the binary representation to the numeric value (e.g., 65).
- Sum the final value of each character (e.g., 'ABC' = 65 + 66 + 67 = 198).
do _i_=1 to length(Category);
ID + input(put(substr(Category, _i_, 1), $binary8.), binary8.);
end;
And the process can be run as a macro program:
http://sas.cswenson.com/downloads/macros/TextID.sas
Note: Run the code in test mode (e.g., %TextID(data, var, test=Y) ) to see each step broken out into separate columns.
The result is a unique ID that will never change for the particular configuration of characters. Keep in mind, however, that the ID is not only character-sensitive, but case sensitive, so "Category" is not the same as "category". To avoid this problem, force the category into upper- or lower-case using the UPPER or LOWER functions.
Using this process, we were able to add a constant, unchanging ID for each category. The ID will not vary depending on the number of categories, the time at which the code is run, or any other variable (other than user-error, if the categories are typed by hand!).
If anyone uses this process, please let me know what you used it for! Oh, and it would be nice if you gave me credit, too!
Hi Chris,
ReplyDeleteIf your ID value is character, then you might be able to use the MD5 function to generate a "hash" value for the string. The MD5 function generates a 128-bit (16-hex character) value. Example:
data withHashes;
length ID $ 16 Description $ 20;
infile datalines4 dsd;
input ID Description;
if (ID = .) then ID = put(MD5(Description),hex32.);
datalines4;
100, Hundred
200, TwoHundred
,Huh
300, ThreeHundred
,AnotherUnknown
;;;;
run;
@Chris Hemedinger: Yes, I thought this was quite similar to MD5 hashing. I like the process I wrote as it only results in numeric IDs, which are a bit easier to use in code (e.g., ID = 52018).
ReplyDeleteHello Chris,
ReplyDeleteI think there is a problem with your function, that it make no diffences between ABC BAC BCA. I tried to run it but compress has too many arguments
David, it sounds like you might be on an older version of SAS. What version are you running?
DeleteThis comment has been removed by a blog administrator.
ReplyDelete