In a previous position, I had the same problem crop up, and I solved the problem by doing the following:
- Extract one character at a time from the original categorical variable (e.g., 'A').
- Convert the character to its binary representation (e.g., '01000001').
- Convert the binary representation to the numeric value (e.g., 65).
- Sum the final value of each character (e.g., 'ABC' = 65 + 66 + 67 = 198).
do _i_=1 to length(Category);
ID + input(put(substr(Category, _i_, 1), $binary8.), binary8.);
end;
And the process can be run as a macro program:
http://sas.cswenson.com/downloads/macros/TextID.sas
Note: Run the code in test mode (e.g., %TextID(data, var, test=Y) ) to see each step broken out into separate columns.
The result is a unique ID that will never change for the particular configuration of characters. Keep in mind, however, that the ID is not only character-sensitive, but case sensitive, so "Category" is not the same as "category". To avoid this problem, force the category into upper- or lower-case using the UPPER or LOWER functions.
Using this process, we were able to add a constant, unchanging ID for each category. The ID will not vary depending on the number of categories, the time at which the code is run, or any other variable (other than user-error, if the categories are typed by hand!).
If anyone uses this process, please let me know what you used it for! Oh, and it would be nice if you gave me credit, too!