The present invention relates generally to data storage and communications systems and, more particularly, to improving the capacity and use of such data storage and communications systems.
Conventional data compression techniques and systems encode a stream of digital data into a compressed code stream and decode the compressed code stream back into a corresponding original data stream. The code stream is referred to as xe2x80x9ccompressedxe2x80x9d because the stream typically consists of a smaller number of codes than symbols contained in the original data stream. Such smaller codes can be advantageously stored in a corresponding smaller amount of memory than the original data. Further, the compressed code stream can be transmitted in a communications system, e.g., a wired, wireless, or optical fiber communications system, in a corresponding shorter period of time than the uncompressed original data. The demand for data transmission and storage capacity in today""s communications networks, as required by the significant increase in content exchanges across such networks, is ever-increasing. Thus, data compression plays an integral role in most modem transmission protocols and communications networks.
As is well-known, two classes of compression techniques useful in the compression of data are so-called special purpose compression and general purpose compression. Special purpose compression techniques are designed for compressing special types of data and are often relatively inexpensive to implement. For example, well-known special purpose compression techniques include run-length encoding, zero-suppression encoding, null-compression encoding, and pattern substitution. These techniques generally have relatively small compression ratios due to the fact that they compress data which typically possesses common characteristics and redundancies. As will be appreciated, a compression ratio is the measure of the length of the compressed codes relative to the length of the original data. However, special purpose compression techniques tend to be ineffective at compressing data of a more general nature, i.e., data that does not possess a high degree of common characteristics and the like.
In contrast, general purpose compression techniques are not designed for specifically compressing one type of data and are often adapted to different types of data during the actual compression process. Some of the most well-known and useful general purpose compression techniques emanate from a family of algorithms developed by, J. Ziv and A. Lempel, and commonly referred to in the art as xe2x80x9cLempel-Ziv codingxe2x80x9d. In particular, Ziv et al., xe2x80x9cA Universal Algorithm for Sequential Data Compressionxe2x80x9d, IEEE Transactions on Information Theory, IT-23(3):337-343, May 1977 (describing the commonly denominated xe2x80x9cLZ1xe2x80x9d algorithm), and Ziv et al., xe2x80x9cCompression of Individual Sequences Via Variable-Rate Codingxe2x80x9d, IEEE Transactions on Information Technology, IT-24(5):530-536, September 1978 (describing the commonly denominated xe2x80x9cLZ2xe2x80x9d algorithm), which are each hereby incorporated by reference for all purposes. The LZ1 and LZ2 data compression schemes are well-known in the art and need not be discussed in great detail herein.
In brief, the LZ1 (also referred to and known in the art as xe2x80x9cLZ77xe2x80x9d) data compression process is based on the principle that a repeated sequence of characters can be replaced by a reference to an earlier occurrence of the sequence, i.e., matching sequences. The reference, e.g., a pointer, typically includes an indication of the position of the earlier occurrence, e.g., expressed as a byte offset from the start of the repeated sequence, and the number of characters, i.e., the matched length, that are repeated. Typically, the references are represented as xe2x80x9c less than offset, length greater than xe2x80x9d pairs in accordance with conventional LZ1 coding. In contrast, LZ2 (also referred to and known in the art as xe2x80x9cLZ78xe2x80x9d) compression parses a stream of input data characters into coded values based on an adaptively growing look-up table or dictionary that is produced during the compression. That is, LZ2 does not find matches on any byte boundary and with any length as in LZ1 coding, but instead when a dictionary word is matched by a source string, a new word is added to the dictionary which consists of the matched word plus the following source string byte. In accordance with LZ2 coding, matches are coded as pointers or indexes to the words in the dictionary.
As mentioned above, the art is replete with compression schemes derived on the basic principles embodied by the LZ1 and LZ2 algorithms. For example, Terry A. Welch (see, T. A. Welch, xe2x80x9cA Technique for High Performance Data Compressionxe2x80x9d, IEEE Computer, pp. 8-19, June 1984, and U.S. Pat. No. 4,558,302, issued to Welch on Dec. 10, 1985, each of which is incorporated by reference for all purposes) later refined the LZ2 coding process to the well-known xe2x80x9cLempel-Ziv-Welchxe2x80x9d (xe2x80x9cLZWxe2x80x9d) compression process. Both the LZ2 and LZW compression techniques are based on the generation and use of a so-called string table that maps strings of input characters into fixed-length codes. More particularly, these compression techniques compress a stream of data characters into a compressed stream of codes by serially searching the character stream and generating codes based on sequences of encountered symbols that match corresponding longest possible strings previously stored in the table, i.e., dictionary. As each match is made and a code symbol is generated, the process also stores a new string entry in the dictionary that comprises the matched sequence in the data stream plus the next character symbol encounter in the data stream.
As will be appreciated and as detailed above, the essence of Lempel-Ziv coding is finding strings and substrings which are repeated in the original data stream, e.g., in a document to be transmitted. The repeated phrases in the document under compression are replaced with a pointer to a place where they have occurred earlier in the original data stream, e.g., document. As such, decoding data, e.g., text, which is compressed in this manner simply requires replacing the pointers with the already decoded text to which it points. As is well-known, one primary design consideration in employing Lempel-Ziv coding is determining whether to set a limit on how far back a pointer can reach, and what that limit should be. A further design consideration of Lempel-Ziv coding involves which substrings within the desired limit may be a target of a pointer. That is, the reach of a pointer into earlier text may be unrestricted, i.e., a so-called growing window, or may be restricted to a fixed size window of the previous xe2x80x9cNxe2x80x9d characters, where N is typically in the range of several thousand characters, e.g., 3 kilobytes. In accordance with this coding repetitions of strings are discovered and compressed only if they both appear in the window. As will be appreciated, the considerations made regarding such Lempel-Ziv coding design choices represent a compromise between speed, memory requirements, and compression ratio. Sliding windows do, however, present at least one drawback: sliding windows do not find strings that occur far apart, e.g., 10,000 character, in the input text.
While prior art compression schemes, such as the aforementioned LZ1, LZ2, and LZW compression methods, provide effective data compression, efforts continue to search for even greater compression to reduce storage requirements and transmission times.
The present invention provides a method and apparatus for achieving relatively low compression ratios based on our realization of using a longer history and longer common strings of the input data stream as an initial evaluation of the input data prior to applying a compression process, e.g., any Lempel-Ziv compression methodology. That is, given that typical compression processes utilize a relatively short, i.e., the most current few kilobytes, history of the input data to effect the desired compression, we have recognized that utilizing a longer history in conjunction with longer common string sequences provides increased compression efficiencies particularly when compressing larger input streams having larger numbers of repeated long strings.
More particularly, in accordance with the invention, the input data is preprocessed by applying string-matching to the extract long common strings. In accordance with the preferred embodiment, the input data is divided into a series of blocks with each individual block having a uniform size, illustratively, 1000 characters in length. Further, in accordance with the preferred embodiment of the invention, a so-called fingerprint is computed and stored for each block. A fingerprint is a relatively small signature of a large text string. For example, a 1000 character string may be mapped into a 32 bit long fingerprint. As such, identical strings will always have the same fingerprint. Furthermore, unequal strings almost always have unequal fingerprints (within a certain probability factor). In accordance with the invention, the input data stream is traversed and comparison is made between interim fingerprints computed from a particular set of characters of the input stream, illustratively, on a character-by-character basis, and the previously computed and stored fingerprints. In accordance with the preferred embodiment, the input stream is traversed as a function of a sliding window having a uniform block size and the interim fingerprints are computed from the present window of characters and compared to the previously stored fingerprints. Upon detecting a match between the fingerprints, in accordance with the preferred embodiment, the input stream is encoded with an identifier determined as function of the detected match. Illustratively, the encoded identifier contains a starting position of the matching string in the original input stream and the string length. Thereafter, in accordance with the preferred embodiment, a further compression of the preprocessed and encoded input stream is made using Lempel-Ziv compression.
Advantageously, in accordance with the invention, a larger history of the input data is examined with the identification of longer common strings without a significant increase in overall storage requirements. Thus, greater compression ratios are achieved in accordance with the invention with a wide variety of compression methods. That is, the principles of the invention are independent of any particular one compression scheme and, therefore, the advantages of employing the various aspects of invention are realized with a wide variety of compression methodologies.
Fingerprinting, as a string-matching mechanism, is not new. Fingerprints have been used previously for string-matching in text-processing systems and, in particular, for searching for long common strings in a text file. For example, R. M. Karp and M. O. Rabin, xe2x80x9cEfficient Randomized Pattern-Matching Algorithmsxe2x80x9d, IBMJ Res. Develop., Vol. 31, No. 2, pp. 249-260, March 1987, which is hereby incorporated by reference for all purposes, describe the use of fingerprints in the context of string searching. It has, however, remained for the inventors herein to recognize that fingerprinting can be employed in the delivery of an elegant compression tool by which enhanced data compression can be delivered to significantly larger input data streams having large numbers of repeated long strings without a significant increase in overall storage requirements.