The present invention relates to data compression (i.e., creation of compressed data from uncompressed data) and decompression (i.e., recovery of the uncompressed data from the compressed data).
Data compression systems are known in the prior art that compress a stream of digital data signals (uncompressed bits) into compressed digital data signals (compressed bits), which require less bandwidth (fewer bits) than the original digital data signals, and that decompress the compressed digital data signals back into the original data signals or a close approximation thereof. Lossless data compression systems decompress the compressed digital data signals back into the original data signals exactly. Thus, lossless data compression refers to any process that converts data into an alternative data form that requires less bandwidth, i.e., has fewer bits, than the data converted in a process that is reversible so that the original data can be recovered.
Accordingly, the objective of data compression systems is to effect a savings in an amount of storage required to hold the data or the amount of time (or bandwidth) required to transmit the data. By decreasing required space for data storage or required time (or bandwidth) for data transmission, data compression results in a monetary and resource savings.
A compression ratio is defined as the ratio of the length of the data in the alternative data form (compressed data) to the length of the data originally (original data). Thus defined, the smaller the compression ratio, the greater will be the savings in storage, time, or bandwidth.
If physical devices such as magnetic disks or magnetic tape are utilized to store the data, then a smaller space is required on the device for storing the compressed data than would be required for storing the original data, thereby, e.g., utilizing fewer disks or tapes for storage. If telephone lines, satellite links or other communications channels are utilized for transmitting digital information, then lower costs, i.e., shorter transmission times and/or smaller bandwidths, result when compressed data is employed instead of original data.
Data compression systems can be made particularly effective if the original data contains redundancies such as having symbols or strings of symbols appearing with high frequency. In fact, redundancies in the original data is a requirement for lossless data compression. A data compression system operating on original data containing redundancies may, for example, transform multiple instances of a symbol, or transform a string of symbols, in the original data into a more concise form, such as a special symbol or group of symbols indicating multiple occurrences of the symbol, or indicating the string of symbols, and thereafter translate or decompress the concise form back into the multiple instances of the symbol, or back into the string of symbols.
For example, it may be desirable to transmit the contents of a daily newspaper via a satellite link or other communications link to a remote location for printing. Appropriate sensors within a data compression system may convert the contents of the newspaper into a data stream of serially occurring characters for transmission via the satellite link. If the millions of bits comprising the contents of the daily newspaper were compressed before transmission and decompressed at the receiver, a significant amount, e.g., such as 50% or more, of transmission time (or bandwidth) could be saved. As a further example, when an extensive database such as an airline reservation database or a banking system database is stored for archival or backup purposes, a significant amount of storage space, such as 50% or more, can be saved if the database files are compressed prior to storage and decompressed when they are retrieved from storage.
To be of practical and general utility, a digital data compression system should satisfy certain criteria. Specifically, one criterion is that the system should provide high performance, i.e., compression/decompression rates, for both compression and decompression with respect to the data rates in the communications channel being utilized, be it a data bus, a wired network, a wireless network or the like. In other words, data transmission rates seen by a sender of uncompressed data and a receiver of the uncompressed data should not be reduced as a result of compression/ decompression overhead. In fact, effective data rates may be significantly increased over slow communications channels, because more original data can be transmitted per unit time, if the original data is compressed preceding and following transmission, since there is less compressed data to transmit than there would have been original data.
The rate at which data can be compressed (i.e., the compression rate) is the rate at which the original data can be converted into compressed data and is typically specified in millions of bytes per second (megabytes/sec). The rate at which data can be decompressed (i.e., the decompression rate) is the rate at which compressed data can be converted back into original data. High compression rates and high decompression rates are necessary to maintain, i.e., not degrade, data rates achieved in present day disk, tape and communication systems, which typically exceed one megabyte/sec. Thus, practical data compression systems must typically have compression and decompression rates matching or exceeding some application-dependent threshold, e.g., one megabyte/sec.
The performance of prior art data compression systems is typically limited by the speed of the random access memories (RAM) and the like utilized to store statistical data and guide the compression and decompression processes. High performance compression rates and decompression rates for a data compression system can thus be characterized by a number of cycles (read and write operations) required per input character into or out of the data compression system. Fewer memory cycles per input character leads to higher performance compression rates and decompression rates.
Another important criterion in the design of a data compression and decompression system is compression effectiveness. Compression effectiveness is characterized by the compression ratio of the system, i.e. a smaller compression ratio indicates greater compression effectiveness. However, in order for data to be compressible using a lossless data compression system, the data to be compressed must contain redundancies. As a result, the compression ratio, or compression effectiveness, in a lossless data compression system (and to a lesser degree in a lossy data compression system) is a function of the degree of redundancy in the data being compressed. The compression effectiveness of any data compression system is also affected by how effectively the data compression system exploits, for data compression purposes, the particular forms of redundancy in the original data.
In typical computer stored data, e.g., arrays of integers, text, programs or the like, redundancy occurs both in the repetitive use of individual symbology, e.g., digits, bytes or characters, and in frequent recurrence of symbol sequences, such as common words, blank record fields, and the like. An effective data compression system should respond to both types of redundancy.
A further criterion important in the design of data compression and decompression systems is that of adaptability. Many prior art data compression procedures require prior knowledge, or the statistics, of the data being compressed. Some prior art procedures adapt to the statistics of the data as it is received, i.e., adaptive data compression systems, and others do not, i.e., non-adaptive data compressions systems. Where prior art procedures do not adapt to the statistics of the data as it is received, compression effectiveness is reduced, but where such procedures do adapt to the statistics, an inordinate degree of complexity is required in the data compression system. An adaptive data compression system may be utilized over a wide range of information types, which is typically the requirement in general purpose computer facilities, while a non-adaptive data compression system operates optimally only on data types for which the non-adaptive data compression system is optimized. Thus, it is desirable that the data compression system achieves small compression ratios without prior knowledge of the data statistics, i.e., that the data compression system is adaptive. Many data compression systems currently available are generally not adaptable and so cannot be utilized to achieve small compression ratios over a wide range of data types.
General purpose data compression procedures are known in the prior art that either are or may be rendered adaptive, two relevant procedures being the Huffman method and the Tunstall method. The Huffman method is widely known and used, reference thereto being had in an article by D. A. Huffman entitled xe2x80x9cA Method for the Construction of Minimum Redundancy Codesxe2x80x9d, Proceedings IRE, 40:10, pp.1098-1100 (September 1952). Further reference to the Huffman procedure may be had in an article by R. Gallagher entitled xe2x80x9cVariations on a Theme by Huffmanxe2x80x9d, IEEE Information Theory Transactions, IT-24:6, (November 1978). Adaptive Huffman coding maps fixed length sequences of symbols into variable length binary words. Adaptive Huffman coding suffers from the limitation that it is not efficacious when redundancy exists in input symbol sequences which are longer than the fixed sequence length the procedure can interpret. In practical implementations of the Huffman procedure, the input sequence lengths rarely exceed 12 bits due to RAM costs and, therefore, the procedure generally does not achieve small compression ratios. Additionally, the adaptive Huffman procedure is complex and often requires an inordinately large number of memory cycles for each input symbol. Thus, the adaptive Huffman procedure tends to be undesirably cumbersome costly and slow thereby rendering the process unsuitable for most practical present day installations.
Reference to the Tunstall procedure may be had in the doctoral thesis of B. T. Tunstall entitled xe2x80x9cSynthesis of Noiseless Compression Codesxe2x80x9d, Georgia Institute of Technology, (September 1967). The Tunstall procedure maps variable length input system sequences into fixed length binary output words. Although no adaptive version of the Tunstall procedure is described in the prior art, an adaptive version could be derived which, however, would be complex and unsuitable for high performance implementations. Neither the Huffman nor the Tunstall procedure has the ability to encode increasingly longer combinations of source symbols.
A further adaptive data compression system that overcomes some of the disadvantages of the prior art is that disclosed in U.S. Pat. No. 4,464,650 for APPARATUS AND METHOD FOR COMPRESSING DATA AND RESTORING THE COMPRESSED DATA, issued Aug. 7, 1984 to Cohen. The procedure of Cohen parses the stream of input data symbols into adaptively growing sequences of symbols. The procedure, however, suffers from the disadvantages of requiring numerous RAM cycles per input character and utilizing time consuming and complex mathematical procedures such as multiplication and division to effect compression and decompression. These disadvantages tend to render the Cohen procedure unsuitable for numerous economical high performance implementations.
An even further adaptive data compression system that overcomes some of the disadvantages of the prior art is that disclosed in U.S. Pat. No. 4,558,302 for HIGH SPEED DATA COMPRESSION AND DECOMPRESSION APPARATUS AND METHOD, issued Dec. 10, 1985, to Welch. The procedure of Welch compresses an input stream of data symbols by storing, in a string table, strings of symbols encountered in an input stream. The Welch procedure next searches the input stream to determine the longest match to a stored string of symbols. Each stored string of symbols includes a prefix string and an extension character that is a last character in the string of symbols. The prefix string includes all but the extension character.
When a longest match between the input data stream and the stored strings of symbols is determined, the code signal for the longest match is transmitted as the compressed code signal for the encountered string of symbols and an extension character is stored in the string table. The prefix string of the extension character is the longest match, i.e., the longest stored string of symbols located in the search. The extension character of the extended string is the next input data character signal following the longest match.
Searching through the string table and entering extension characters into the string table is effected by a limited searching hashing procedure. Unfortunately, even the improved data compression system of Welch suffers from less than optimal compression effectiveness, and less than optimal performance. As a result, the Welch procedure, like the Cohen procedure, is unsuitable for many high performance implementations.
The present invention advantageously improves upon the above-described approaches by providing a lossless data compression (i.e., creation of compressed data from uncompressed data) and decompression (i.e., recovery of the uncompressed data from the compressed data) approach that improves on heretofore known data compression and decompression approaches.
In one embodiment, the present invention can be characterized as a method of data compression for transmission over a communications channel. The method comprises the steps of receiving one or more data symbols comprising a current data string; matching a longest previous data string with the current data string; placing, in the event the longest previous data string having been matched is a single symbol, the single symbol into a compressed data stream; placing, in the event the longest previous data string having been matched is a multiple symbol data string, a code word into the compressed data stream, wherein the code word is indicative of the longest previous data string; attempting, in the event a multiple symbol data string is matched, to extend the string by one or more symbols by comparing the symbols following the matched previous data string with the symbols following the current data string whose said code word was placed; placing, in the event the multiple symbol data string can be extended, a string-extension length indicative of the number of symbols that matched; placing a one or two bit code prefix into the compressed data stream, the one or two bit code prefix indicating whether the following bits are said single symbol, said code word, or said string-extension length; and transmitting the compressed data stream through the communications channel.
In accordance with a further aspect of the present invention, a method for decompressing data received over a communications channel is provided. The method comprises the steps of receiving a plurality of bits; determining whether the plurality of represents a single symbol, code word, or string-extension length; placing, in the event the plurality of bits represents a single symbol, the single symbol into an output data stream; placing, in the event the plurality of bits represents a code word, a data string defined by the code word into the output data stream; and placing, in the event the plurality of bits represents a string-extension length, an extension string being copied from said output data stream at a symbol following a last symbol of the previous code word processed. The receiving step comprises receiving a code prefix consisting of one or two bits and a plurality of subsequent bits. The determining step comprises determining, using said code prefix, whether said plurality of subsequent bits represents said single symbol, said code word, or said string-extension length.