In the field of data processing, an 8 bit byte is the traditional unit of computer data. Typically, individual characters in a file or data set are separately denoted and stored as single bytes. Commonly known and used single-byte character formats include American standard code for information interchange (ASCII) and extended binary coded decimal interchange code (EBCDIC).
There are however, problems or difficulties which have arisen due to limitations of the 8 bit byte character format. First of all, with only 8 bits per character, there can be only 256 (2 to the 8th power) different characters represented. While 256 characters is generally sufficient for languages such as English, it is generally insufficient for other languages such as Kanji or Chinese. Secondly, 8 bits (256 characters) is generally insufficient to represent a combined language environment, such as English characters intermixed with math symbols and/or control characters.
Furthermore, software vendors have been forced to make `localized software` when distributing software in multiple countries because 256 characters is generally insufficient to support all of the different characters needed for all of the different languages of those countries. Completion, maintenance and support of localized software can be a tremendous undertaking. Thus, the typical single byte character formats are inadequate in an increasingly complex global computing environment.
For these reasons, a new character format or standard has emerged known as Unicode. As is well known and is explained in "The Unicode Standard, Worldwide Character Encoding" Version 1.0, Volume One, Copyright 1990, 1991 Unicode, Inc., Unicode is a fixed-width, uniform text and character encoding scheme utilizing a 16-bit architecture which extends the benefits of ASCII to multilingual text. Unicode characters are consistently 16 bits wide, regardless of language, so no escape sequence or control code is generally required to specify any character in any language. Unicode character encoding treats symbols, alphabetic characters, and ideographic characters identically, so that they can be used simultaneously and with equal facility.
Because there are 16 bits per character, it is possible to represent up to 65,536 (2 to the 16th power) different characters with Unicode. The Unicode standard currently contains over 28,000 characters, including 2,300 general (alphabetic or syllabic) letters, 1,200 textual symbols, and 3,300 CJK (Chinese/Japanese/Korean) phonetics, punctuation, symbols, Korean Hangul syllables and over 20,000 Han characters.
The Unicode format, as stated above, utilizes 16 bits for each character represented. Referring now to FIG. 1, the format of a single generic Unicode character of 16 bits is Shown. The first half (first 8 bits), or prefix, of each character represented in Unicode is an indicator of the group (e.g., math symbol, Kanji, English, etc.) of the particular character being represented. The second half (second 8 bits), or suffix, of each character represented in Unicode indicates which particular character within the indicated group is being represented.
Unicode can therefore more easily represent a variety of characters in a single document or file without requiring specialized or localized software. However, the storage overhead of Unicode data is, by definition, larger than with 8-bit character formats because Unicode data uses 16 bits per character. Thus, documents or data files stored in the Unicode format are generally twice as large as would be the same documents or data files represented in ASCII, for example. There is therefore a need to reduce the increased size of Unicode files while still retaining the ability to represent the range of characters supported by the Unicode character format.
Typical compression methodologies handle uncompressed data on a byte-by-byte basis. Compressing data on a byte-by-byte basis generally works well for data which is comprised of characters stored in a single-byte-per-character format. Referring now to FIG. 7, an example compression method which is well known in the art processes an uncompressed input data stream 10 to generate a compressed data output stream 20 by comparing an uncompressed portion 13 of input data stream 10 to data in a history buffer 11 of already processed input data. If a matching data string 12 is located in history buffer 11 for current data string 14, data string 14 is encoded in compressed data stream 20 as a pointer (p.sub.o,l.sub.o) 24, corresponding to an offset p.sub.o 15 and a data length l.sub.o 16. The shorter length data of pointer (p.sub.o,l.sub.o) 24 thus replaces longer data string 14 in output compressed data stream 20.
Unfortunately, such prior art compression approaches do not work as well with Unicode data because each character in the Unicode format is comprised of two bytes. One problem is the greater time needed to compress Unicode data. This is because Unicode data comprises suffix character data interspersed with prefix group indicators and hence, in general, more bytes have to be scanned in order to find each match.
Another problem is the general doubling of the value of the resulting length (l) and offset (p) values to what is the equivalent matching character as would have occurred in a non-Unicode data format. Doubling the value of either the length (l) or the offset (p) values results in a decreased compression ratio, an undesirable side effect.
A still further problem is the increased difficulty in finding matching strings when the current data string to be matched occurs at a break between a prefix group indicator and its associated suffix character due to the previous matching-string. In that situation, one is no longer merely trying to match a character and its associated prefix group code with an earlier character and its associated prefix group code. Instead one is trying to match a character and a following prefix group code with an earlier occurrence of the same character and the same following prefix group code. There is less likelihood of finding such a match and so this too results in a decreased compression ratio.
Thus, an improved compression methodology is needed to handle the larger data files of the two-byte-per-character Unicode format.