The present invention relates to compression and encoding. In particular, the present invention relates to compression of Unicode strings.
In computing environments, it is desirable to represent symbols and characters found in human language as hexadecimal values. The collection of hexadecimal values needed to represent an entire language is known as a character set. In order to accommodate all of the symbols found in the various languages throughout the world, the computing community developed the Unicode character set that uses two bytes to represent any one character or symbol.
In order to allow portability of computer objects, it is preferred that computer programs that manipulate language characters be written to manipulate Unicode characters. Although this Unicode standard increases the portability of computer programs, it is incompatible with certain communication channels that only allow the transmission of certain values. In particular, it is incompatible with channels that require that each byte of channel information be within the range of characters found in the printable American Standard Code for Information Interchange (ASCII), which is limited to values between 20 hexadecimal and 7E hexadecimal.
To overcome this incompatibility, the art has developed encoding methods for converting hexadecimal bytes into printable "ASCII" characters. One such method is known as UUencode and involves dividing the stream of data into three-byte sequences and then dividing each of the three-byte sequence into four six-bit values. Two "0" bits are placed in front of each of the four six-bit values to produce eight-bit values that are added to "20" hexadecimal. This produces four eight-bit values that are between "20" hexadecimal and "5F" hexadecimal. Note that under UUencoding, each three-byte sequence is converted into a four-byte sequence resulting in a thirty-three percent increase in the size of the data. Thus, three Unicode characters, which are together represented by six bytes, would be converted into eight bytes of UUencoded data.
To overcome this increase in data size, some prior art systems have used compression algorithms to compress the UUencoded data. Typically, the compression algorithms reduce the size of the data by finding identical bit sequences in the data. For each set of identical bit sequences, the compression algorithm keeps the first bit sequence and replaces the other matching sequences with a value indicating the length of the replaced sequence and a value indicating the location of the first bit sequence that matches the replaced sequence.
Although such compression algorithms work well with large strings of data, they do not work well with short data strings because the redundancy of short data strings is low. In fact, in some cases, compression algorithms can cause the data to increase in size due to overhead data that must be added to the compressed data string to indicate the type of compression that was performed.