The invention disclosed broadly relates to the field of information handling systems, and more particularly to the field of data compression.
Expressing the same digital information with fewer bits is a continuing challenge in the field of information technology. This is particularly the case with the various standard character sets have been adopted for expressing alphanumeric characters in digital form. One case is Unicode, a superset of the ASCII (American Standard Code for Information Interchange) character set that uses two or more bytes for each character so that it can house the alphabets of most of the world""s languages. Under the Unicode scheme, as in others, a unique number represents each character. Other character sets use a different set of digital numbers to represent characters. Unicode generally requires more data to specify alphanumeric characters than ASCII because it can express characters in various alphabets. There is thus a need for a compression method for Unicode that is well suited for certain classes of applications, such as large databases. There is a further need for a compression method that compresses small strings well, such as individual fields in a database. These are situations where compression mechanisms such as LZW (Lempel-Ziv-Welch) do not work well because they are better suited to large bodies of text. In addition, there is a need for one very important characteristic: binary comparison. For many applications, it is very important that databases be able to have the same binary order for compressed Unicode fields as they do for uncompressed fields. Other encoding schemes such as SCSU use essentially random binary order, which makes them unsuitable in many applications.
Briefly, according to the invention, a system and method for encoding an input sequence of code points to produce an output sequence of bytes include the steps of:
receiving a plurality of values, each value representing a code point (character) in the input sequence;
calculating a signed delta value for each code point in the input sequence, wherein each delta value is determined by subtracting the value of a base code point from the value of the current code point to produce the delta value for the current code point;
encoding each delta value into a set of bytes wherein small deltas are encoded in a small number of bytes and larger delta values are encoded in successively larger numbers of bytes;
selecting a lead byte value for the output sequence so that the binary order of the output sequence is the same as the binary order of the input sequence;
writing to the output sequence each delta value for each code point in the input sequence.