1. Technical Field
The present invention relates to a method and system for data compression in general, and in particular to a method and system for compressing data within a data processing system. Still more particularly, the present invention relates to a method and system for compressing UNICODE data within a data processing system.
2. Description of the Prior Art
Electronic information processing and transmission are currently experiencing dramatic growth internationally. Consequently, there is a growing need for internationalization and standardization of information coding formats. The well-known ASCII symbol set, which can only accommodate 256 possible byte symbols, is incapable of accommodating even a small fraction of all the characters and symbols in the world as a whole. Just taking financial documents originated from the United Kingdom as an example, most of them demand an ability to retrieve and display the Pound Sterling symbol (.English Pound.), which is not found in the standard ASCII symbol set. Thus, it is evident that a much larger symbol set is required to handle documents that are in Russian, Greek, Arabic, and various Asian languages.
One attempt at symbol coding standardization entails retaining ASCII as the basic operating symbol set and providing a country-specific code page to the ASCII symbol set for each computer system. This approach, however, only works well with computer systems that are intended to be utilized within a few countries; and it will become increasingly less effective as documents and electronic commerce are becoming more international.
A new approach in solving the above-mentioned problem is by utilizing a new symbol set known as UNICODE. In essence, the UNICODE symbol set solves the problem of code depletion by allocating two bytes per symbol. For many of the more popular languages, one of the two bytes of a UNICODE symbol serves as a code page specifier and the other byte designates a member of the particular code page set. As a result, data within a document tends to be comprised of byte-pairs, with one byte of each byte-pair being the same. This can be illustrated by the following example of a 36-byte fragment of data. The fragment is first listed as it appears on an ASCII text printer, and then in a hexadecimal format:
__________________________________________________________________________ Cormack Horspool 1985 dynamic Markov 436F726D 61636B20 486F7273 706F6F6C 20313938 35206479 6E616D69 63204D61 726B6F76 __________________________________________________________________________
In UNICODE, the same 36-byte fragment of data will look like:
__________________________________________________________________________ C o r m a c k 1 9 8 5s p o o l d y n a m i c M a r k o v 43006F00 72006D00 61006300 6B002000 48006F00 72007300 70006F00 6F006C00 20003100 39003800 35002000 64007900 6E006100 6D006900 63002000 4D006100 72006B00 6F007600 __________________________________________________________________________
Note that twice as many bytes of data are required by UNICODE. Note also that the two hexadecimal formats differ only in that the UNICODE has an additional "00" byte inserted after each ASCII byte. This "00" character typically prints as a blank space on most computer systems, which appears as "gaps" in the text when output by an ASCII text printer as shown above.
Most adaptive compression algorithms, such as Lempel-Ziv 1, typically operate on symbols which are usually one byte in size. The UNICODE symbol set, which is essentially a two-byte symbol set, can prove quite detrimental to those types of data compression algorithms. Although a better compression ratio can be achieved with a larger size of UNICODE document, the final compression result of the UNICODE version is still worse than that of an ASCII version, even though the information content within each of the two versions is still virtually the same. Consequently, it would be desirable to provide an improved method for compressing UNICODE data within a data processing system such that the compression ratio achieved for a UNICODE version is comparable with that obtained from an ASCII version of the same document.