Digital networks now carry voice, facsimile, data and video signals. With this vast amount of data being exchanged, data compression is used in these networks to increase efficiency by transmitting the signals in a compressed form. The cost and time savings are significant and thus highly desirable. This in turn has resulted in a continual demand for improved data compression algorithms and techniques.
Data compression is an operation where a signal, e.g., computer data, that requires a certain number of bits, is represented, or encoded, using fewer bits overall. The ratio between the number of bits required for representing the original signal versus the number of bits required by the encoded signal is generally known as the compression ratio. The complimentary process in which the compressed and encoded signal is expanded and reconstructed to form its original representation is generally known as either decompression, decoding or reconstruction.
Data compression techniques have been refined over time and now include primarily two types of compression: lossy and lossless. In a lossy compression system, portions of the data that are determined to be less necessary than others are discarded, making exact reconstruction, or decompression, of the signal impossible. Accordingly, lossy compression is usually employed, for example, in connection with signals such as speech, audio, images and video in which exact reconstruction of the original signal is usually not required for to be acceptable. Moreover, since these types of signals are generally destined for human perception, such as by the human auditory or visual senses, minor differences between the original and reconstructed signals may either be undetectable by human senses or the slightly degraded signal can be tolerated.
In contrast, lossless compression enables an exact reconstruction of the original signal performed upon decompression and can achieve a perfect recreation of the original signal without the degraded or compromised characteristics of lossy compression techniques. One of the consequences of employing lossless compression, however, is that the compression ratio, or the ability to compress a large number of data bits into a smaller number of data bits, is greatly reduced. Despite this, for certain types of data information, it is imperative that perfect reconstruction of lossless data compression be employed rather than the compromised reconstruction approach characteristic of lossy compression techniques. For example, computer data, such as an executable file, must be precisely reconstructed the file is not likely to execute properly. Similarly, if important data is being transmitted, the failure to precisely reconstruct the transmitted data is likely to lead to the loss of at least some of the transmitted data.
At present, various entropic compression methods and pre-compression transformations exist. The existing lossless compression algorithms are typically categorized according to their approach to extract predictive information and/or repetitive patterns embedded in the signal, e.g., repetition of patterns, and the methods used to efficiently encode the information.
Typically, lossless compression algorithms encode the source information in a more compact and optimized way using global statistics and information.
While utilizing lossless compression techniques for communication purposes, optimized encoding is achieved by using information embedded in the previously transmitted data. However, the encoder and decoder must retain synchronization therebetween regarding the transmitted/received data used to encode/decode that data, in order for such applications to operate successfully.
Two famous two lossless data compression algorithms, LZ77 and LZ78, were described by A. Lempel and J. Ziv in 1977 and 1978, respectively. These two algorithms are both dictionary coders. LZ77 is the sliding window compression algorithm, which was later shown to be equivalent to the explicit dictionary compression technique of LZ78—however, these two algorithms are only equivalent when the entire data is intended to be decompressed. LZ78 decompression allows random access to the input as long as the entire dictionary is available, while LZ77 decompression must always start at the beginning of the input.
Based on these algorithms, upon arrival of a data-containing stream, its compression may be achieved by detecting known combinations in the form of strings, data blocks, etc., and replacing them with their respective coordinates in a dictionary that the receiving side also uses. By following this method, i.e., where the detected combinations are removed in their entirety from the arriving data stream and only their coordinates (pointers) are conveyed, a compression of the arriving data stream is achieved.
Many methods have been suggested in order to establish an appropriate pointer for the detected data blocks. One such method, for example, relies on the fact that in a certain data stream one is likely to find a repetition of data. Therefore, a typical solution would be to use a history of the data as a dictionary for the yet arriving data. U.S. Pat. No. 5,936,560 discloses such a solution wherein a dictionary window is used for comparing a stored history of data with data to be compressed, and, when a data match is found, a code, e.g., a pointer, indicating a length of the data match and a code indicating a relative position of data in the dictionary window that produced the data match, are generated. In the comparison, there are m groups of data in the is dictionary window, each of the m groups includes a total of n data, that are compared substantially simultaneously with a total of n data in the data to be compressed, where m=2, 3, . . . and n=2, 3, . . . , and the compressed data is generated by encoding the data that produced the longest data match.
P. Deutsch in “DEFLATE Compressed Data Format Specification version 1.3”, RFC 1951, Network Working Group, 1996 suggested a lossless compressed data format that compresses data using a combination of the LZ77 algorithm and Huffman coding. The data can be produced or consumed, even for an arbitrarily long sequentially presented input data stream, using only an a priori bounded amount of intermediate storage.
In typical known solutions, for every byte that arrives, one needs to calculate and store a hash function, so that when an identical byte arrives its hash function is calculated and used while retrieving the required pointer from a hash table. The best match is declared when the highest number of consecutive matching bytes is detected. One of the major drawbacks of such a mechanism is that for every entry in the dictionary (irrespective of how this dictionary was constructed) it is necessary to calculate and to store a corresponding hash function, and to do that also for the newly arriving data. Therefore, not only does this mechanism require substantial computational resources it also requires a substantial amount of storage capacity to cope with the above requirements.
What is needed, therefore, is a way to increase the efficiency of the compression, reduce the amount of required calculations as well as reduce the amount of required storage capacity for the hash table.