The present invention relates to a data compression and decompression processor and corresponding methods therefor.
Data compression/decompression techniques are well known in the prior art. Examples of prior art approaches are disclosed in U.S. Pat. No. 4,464,650 (the '650 patent), issued in August 1984 to Eastman, and U.S. Pat. No. 4,558,302 (the '302 patent), issued in December 1985 to Welch.
The above patents present implementations of a data compression algorithm described by Lempel and Ziv in "Compression of Individual Sequences Via Variable-Rate Coding," IEEE Transactions on Information Theory, IT-24-5, September 1977, pp. 530-537, and "A Universal Algorithm for Sequential Data Compression," IEEE Transactions on Information Theory, IT-23-3, May 1977, pp. 337-343. These two papers presented the original ideas and principles of compression algorithm. However, no implementation method was given.
Data compression algorithms are known as a discipline which design the shortest representation of the original data string. Reversible data compression implies that a given data string being compressed and then decompressed should be identical to the original data string, and not a single bit of information is allowed to be distorted or lost. This reversibility is identical to the concept of noiseless data compression in the context of information theory. Hence, noiseless data compression is quite different from speech compression and video compression where distortions are allowed. For typical applications of data compression for data being stored or transmitted, noiseless data compression is the only kind of compression which can be used.
The benefits of data compression are tremendous. The benefits of data compression technology to communications have been the increase in bandwidth available to the users and a savings of the cost of the communication device. For example, to transmit 9600 bps digital information over existing telephone lines requires conditioned lines and 9600 bps high-speed modems. If one applied advanced data compression technology with 4:1 compression ratio, one may use 2400 bps modems and unconditioned lines for such transmissions. In this example, the savings is not only the difference in modem cost between 9600 bps modems and 2400 bps modems, also the monthly charges for the conditioned lines. As a further example, a typical branch of a bank requires four leased lines for operation (one for the ATM machine, one for news broadcast, and two for data processing). If a multiplexer has a built-in compressor with 4 to 1 compression ratio, then only one line will be needed to communicate with the outside world, and a savings of monthly charges of three lines can be achieved.
As to back-up systems, data compression technology can greatly reduce the back-up time, increase the capacity and save the back-up device cost. For example, for the high-end 1/2" reel to reel tape back-up systems, typically four tapes are required to be mounted and dismounted from the tape drive system. It normally takes 30 minutes to back up one tape. Due to this sequential process, it becomes a very-time-consuming process to back up computer data.
If a back-up system has a built-in compression capability and compression ratio of 4 to 1, then one may use one tape instead of four tapes for the back-up purpose, and the back-up process becomes an unattended process. The other major benefit of data compression technology for back-up is that the person performing the back-up can simply mount the tape and leave the system unattended. By the same token, data compression technology can increase the capacity of the disk drive, reduce the data access time, and save the disk cost. Again, the savings are especially significant for high-end disk drive systems where multiple heads and multiple platters are required. With the help of data compression technology, only single head and single platter can be used to achieve the same capacity requirement. Due to fewer parts, this approach also improves the reliability of the system. However, these benefits will be realized only if the data compression technology meets each of the following requirements:
1. Its encoding process should be string-oriented rather than character-oriented in order to achieve good compression ratios. PA1 2. The algorithm must be adaptive to the data source so that good compression ratios can be achieved for all data files. This capability is also called "universal" data compression. PA1 3. Its data processing rate must be at least as high as the system data transfer rate so that system throughput will not be degraded. PA1 4. The cost of data compression hardware implementation must be low. Preferably, the implementation of the data compression algorithm on a single VLSI chip is feasible. PA1 5. The integration of the data compression technology into the system must be simple, and the data compression technology cannot cause any system problems. PA1 6. The compression capability must be totally transparent to the users. PA1 1. It is a two-pass algorithm. The first pass involves the statistics collection needed to construct the codetable to be used by the compressor and decompressor. For many real life applications, this is unacceptable. PA1 2. The compression ratio is less than optimal, since the complexity of multiple symbol encoding or string encoding is exponential, and practical implementation of Huffman coding has been limited to character encoding only. PA1 3. The compression ratio is poor. Frequently, the codetable is mismatched to the information source. On the other hand, the implementation of adaptive Huffman coding is cumbersome and complex, and the convergence of the adaptive statistic collection is slow. PA1 4. Since Huffman encoding is a fixed to variable length encoding, the compression algorithm itself is sensitive to error propagation and the decoding process could be slow. PA1 5. The same codetable needs to be stored or transmitted between the compressor and the decompressor. This represents additional overheads for the compression purpose. PA1 1. It is a single-pass encoding technique and statistics collection is not needed. PA1 2. It is string-oriented rather than character-oriented, so it achieves good compression ratios. PA1 3. It is adaptive to the data source so that it becomes a good general purpose data compression algorithm. PA1 4. The decompressor is able to locally generate the replica of the dictionary used by the compressor, based on the compressed strings, so that it is not needed to share the dictionary between the compressors and the decompressors. PA1 1. They are single byte or a fraction of a byte processing. In the string parsing process, the hashing is done based on the next extended character and the content of the hash address. Hence, it is impossible to perform multi-character processing due to the sequential nature of the hashing process, and it is impossible to perform parallel processing based on these existing approaches. PA1 2. They are sequential file compression. In other words, data blocks are chained together sequentially due to the use and construction of the dictionary. For example in order to read the 50th block of a 1 megabyte file, one needs to decompress the file from the first character until the 50th block is reached. The sequential file decompression implies that these techniques lack the random block access capability. This is absolutely unacceptable for disk random access environment. PA1 3. The N-time hashing technique used in the string-matching process is non-guaranteed. Subsequently the compression ratio suffers and becomes poorer. PA1 4. The dictionary has to be reset at the beginning of a given file. For block-oriented applications, such as synchronous communications, packet switching and disk environment, if a given block is treated as a file, then the dictionary is always reset at the beginning of every block, and the compression ratio suffers greatly. Since the dump dictionary is also partially filled and only shorter words are being stored in the dictionary. It is quite clear that adaptation and efficiency of compression for short blocks are incompatible; the learning cycle necessarily entails the initial construction in the dictionary of short strings peculiar to the incoming data, during which time even shorter strings are transmitted. It is not until long strings have been constructed that the system achieves high compression efficiency. PA1 5. During the decompression processing, data is read in the reverse order from a stack. This also implies that numerous RAM cycles are needed just to decompress data.
Currently, the most popular noiseless data compression method being used in the industry is the traditional Huffman coding or its modified versions. Huffman coding is a two-pass coding method. The first pass performs statistics collection of the data source; the second pass constructs a codetable based on the statistics collected. The underlying principle is to replace the more frequent characters by shorter codewords and less frequent characters by longer codewords. This coding method normally provides poor compression ratios because it is character encoding rather than string encoding.
To elaborate this point, consider encoding the more frequent words by shorter codewords and less frequent words by longer words. It is intuitive that the compression ratios to be achieved will be even better than compression ratios achieved by character encoding. For example, it is known that as the English word "the" is the most frequent word, so that it may be encoded by the shortest codeword. This is what is normally called encoding on the extended information source or multiple symbol encoding.
In general, the complexity and memory requirement of Huffman coding based on extended symbols (multiple symbols) source grow exponentially. For example, if the symbol size in the Huffman coding system is a byte, then the codetable requires 256 entries for character encoding. If the symbol size in the Huffman coding system is two bytes, then the codetable requires 65,536 entries for two byte encoding. For the particular example mentioned earlier, three-character encoding for "the" requires 16,777,216 possible entries for the codetable. Hence, practical implementation of Huffman coding has always been limited to character encoding only.
Typically, Huffman coding requires the same codetable to be shared or transmitted between the sender's and receiver's ends for successful compression and decompression. In practice, several classes of codetables are preassigned for data compression and decompression purposes. The compressor/decompressor switches to different codetables according to the closest match between the data source and the codetable. The resulting compression ratios are further deteriorated due to the mismatch between the source statistics and the codetable being used. Although it is possible to construct an adaptive version of Huffman coding, the implementation is cumbersome and the improvement on compression ratios is limited.
Since Huffman coding is a fixed to variable coding technique, the decompressor parses the compressed strings on a bit-by-bit basis until a legal codeword has been recognized. Because the length of the legal codeword is not known beforehand, the decompression process could be slow. The principle of Huffman coding is described in the article of D. A. Huffman entitled "A Method for the Construction of Minimum Redundancy Codes," Proceedings of IRE 40, pp. 1098-1100 (September 1952). In summary, Huffman coding suffers from several limitations:
The first major effort to alleviate these limitations was the disclosure of the Lempel-Ziv algorithm identified above. Basic principles of Lempel-Ziv algorithm were described in the two papers cited above. The basic principles of Lempel-Ziv algorithm work as follows:
The compressor first prepares for a dictionary where variable length dictionary words will be stored. Initially, the dictionary could be empty or partially filled. Then the incoming string will be parsed and examined on a character by character basis, so that the longest match between the input string and the words in the dictionary will be identified. This is also called "greedy" parsing. As soon as the greedy parsing is completed, the extended substring of the longest match by one character will be entered into the dictionary as a new word, and the index of the word in the dictionary with a match to the longest substring of the incoming string will be output as a compressed word.
The decompressor will also prepare for a dictionary with the same principle to build and fill the dictionary as the compressor does. By examining the received indexes, the decompressor reconstructs the dictionary and performs the decoding process without physically storing or transmitting the actual contents of the dictionary from the compressor to the decompressor.
The preeminent features of the Lempel-Ziv algorithm are its ability to adapt to any type of incoming data source, and it extracts the longer and more frequent substrings from the data source as words stored in the dictionary with virtually no limitation on the length of the word as long as the local memory can accommodate them.
Indirectly, the Lempel-Ziv algorithm is performing dictionary construction on the fly and carrying out data compression in the sense of extended source encoding or string encoding. For example, if the word "international" is stored in the dictionary being matched to the incoming string and the size of the dictionary is 4,096, then only 12 bits are required to represent 13 byte word "international." Hence, a tremendous amount of compression has been achieved.
The Lempel-Ziv algorithm provides distinct advantages over Huffman coding for the following reasons:
The '650 and '302 patents identified above provide two different implementation methodologies for the Lempel-Ziv algorithm. In the implementation of Lempel-Ziv algorithm, two key components are the data structure used to store the input characters/ substrings for the purpose of dictionary construction and hashing function used to perform the string-matching algorithm. In the '650 patent, the data structure is based on tree-structures, and data is processed two bits (or a fraction of a byte) at a time for a reasonable memory requirement, but the processing speed is slow. In the '302 patent, the data structure is based on a large hash array structure, and data is processed one byte at a time, and the processing speed has been somewhat improved. Both patents cover only the implementation of the compression/decompression apparatus but not applications.
The '650 patent made the first attempt to make the Lempel-Ziv algorithm practical and realizable in implementation. However, since the data structure is based on Tree structure, the compression apparatus only processes a fraction of a byte at a time, and the compression process requires numerous RAM cycles. Furthermore, the algorithm itself utilizes complex mathematics formulas and arithmetic operations such as multiply and divide.
The '302 patent made further progress on the practical implementation of the Lempel-Ziv algorithm. In essence, the compression processing is done one character (one byte) at a time. More importantly, the storing strings of input data character signals comprise prefix code signals and the extended character only. The storing strings in the dictionary are compared to the input data stream to determine the longest match.
It will be helpful to re-examine the six requirements for data compression technology to be useful for system applications. The Lempel-Ziv algorithm and the '650 and '302 patents only meet the first two requirements.
Notice that the compression throughput requirement varies according to applications. For example, high-speed modem applications require no more than 20 kilobits/sec. On the other hand, for today's typical peripheral applications, the Small Computer System Interface (SCSI) II requires data transfer rate up to ten megabytes/sec for the synchronous operation. Currently, the SCSI data transfer has become a common standard for peripheral applications.
As an example, assume that the nominal clock rate is 40 megahertz. This implies that the compressor and decompressor based on prior arts will have to process data at one character (byte) per four clock cycles. This also assumes extremely high-speed, expensive memories are being used in order to keep up with the fast clock rate. Hence, there will be very stringent compression throughput requirements for real-life peripheral applications. Besides, current bus structures being used in most computer systems are either 16-bit or 32-bit wide. This implies that more than a byte of information will be available at any given cycle time. Therefore, parallel data compression processing will be needed to meet such requirements in order to achieve high-speed throughput rate memory implementation cost to be manageable.
The other possibility to achieve high speed implementation based on prior arts is to de-interleave the input data string and use multiple 8-bit oriented compressors for compression processing. However, this de-interleaving process destroys the input data correlation structure, and hence degrades the compressor ratios significantly.
The application of compression apparatus based on the Lempel-Ziv algorithm and the '650 and '302 patents are limited in terms of their applications due to the following reasons:
In view of the foregoing, it would be very desirable to provide an improved data compression/decompression processor and corresponding methods which overcome the deficiencies and limitations of prior art approaches.