Each character in a conventional character set can be represented as a number of hexadecimal digits. Although some of these character sets have a sequential ordering of weights used for comparison, the values for comparison are typically expanded to a much larger size and cannot be inverted.
For example, each character in the Unicode Standard is represented by a 2-byte value that is specified as 4 hexadecimal digits, from 0x0000 to 0xFFFF. For example, the Latin character capital “A” is represented in Unicode (UC) as 0x0041.
The international Unicode character standard was defined without a detailed ordering on its characters, although the concept of “Levels of Comparison”, which is translated as “Weights” in Microsoft NT produced by Microsoft Corporation of Redmond, Wash., is covered in the Unicode Standard 3.0. Microsoft NT defined an ordering of Unicode much earlier, based on the lexicographic order of a hierarchical sequence of weights, and this ordering was adopted in Microsoft SQL Server indexes.
Transformation of Unicode strings into strings of byte-weights that can be compared byte-by-byte has existed in NT for some time. However, this conventional byte-weight solution is characterized by expansion of Unicode UC values to a much larger size, and does not provide the ability to invert the result to recover the original Unicode string.
The previous approach to indexing Unicode strings in SQL Server was to hold the strings in their original Unicode form, and compare the strings using the “DBLCCompareString” call, which performed a finite-state machine calculation based on NT weights. This used a lot of CPU processing power in comparisons, and also made prior key prefix compression of strings in index entries less effective (because Unicode strings with different prefix character sequences could sort together).
In view of the foregoing, there is a need for systems and methods that normalize character sets, such as Unicode, to a compressed, invertible representation that overcomes the limitations and drawbacks of the prior art.