Great efforts had been invested in last years in the development of efficient general purpose data compression algorithms capable to adapt themselves to several types of information, and capable to obtain equivalent results as specific-oriented ones, based on an advanced knowledge of the characteristics of the information. Ideally, systems that accomplish this task should be fast and capable to obtain good compression rates, and from the other side, to have limited complexity.
The works published in the last years have been primarily oriented to fulfill both demands and reaching an optimal compromise. The roots of data compression methods are based on the redundancy existing in most types of data. It is well-established the non-optimality of human-type of communications, as can be seen for example in text files. Methods that first gather the statistics or the lexicographic rules and, after that, accomplish the compression process, are classified as “Off-line methods”. A general disadvantage of this principle is the necessity of transmitting additional information referring to these statistics. On the other hand, a method is classified as Adaptive if the compression is accomplished based on the information so-far processed, that in most cases is achieved by an on-line or “one-pass” procedure.
There are many approaches in the prior art for compressing data. D. A. Huffman in “A method for the construction of minimum redundancy codes”; Proceedings of the IRE; Vol. 40; 1952; pp. 1098-1110, proposed a classical algorithm based on assigning variable-length codes to fixed-length symbols (for example bytes) of the input alphabet, based on the statistical probability of the symbols in the context of the whole string or input file.
The next generation of Huffman codes were directed to make the algorithm one-pass and adaptable to the changes in the context. With these modifications the compression was able to take into account the probabilities of the so-far processed symbols to make a decision how to code the next incoming symbol. See D. E. Knuth, “Dynamic Huffman codes”; Journal of Algorithms, Vol. 6, No. 2; June 1985; pp 163-180; J. S. Vittier; “Design and analysis of dynamic Huffman codes”; Journal of the ACM; Vol. 34; No. 4; October 1987; pp. 825-845; and also J. S. Vittier, “Dynamic Huffman coding”; ACM Transaction on Mathematical Software; Vol. 15; No. 2; June 1989; pp. 158-167.
More recently, it have been was discovered a way to avoid the waste of coding derived from coding the fractional entropy [Log P(Xi)*P(Xi)] in an integer number of bits. This conducted to the so-called arithmetic coding. The advantages of this type of coding become apparent since its optimality referring to the entropy, is only limited by practical considerations regarding the precision of the arithmetic required. As a rule, the obtained results are very close to the theoretical entropy.
Nevertheless, none of the above first-order methods are able to exploit the correlation, existing in most types of data, from one sub-string to the next character. To illustrated the above statement, let us to consider the case when the letter ‘q’ is the last symbol processed. For German and English language (among others), the probability having a ‘u’ as the next incoming letter is quite high, no matter how frequently the symbol ‘u’ had appeared earlier in the past string, nor if it had appeared at all previously. Due to this fact, the compression ratios reachable by first-order compression algorithms is limited. High-order methods, discussed below, usually obtain higher compression rates. High-order models had been devised for the statistical methods, see for example “Cleary et al; “Unbounded Length Contexts for PPM”; Data Compression Conference(DCC95), 1995”, but the increase in performance is conveyed with a substantial decrease in speed and high memory requirements.
Another principle for data compression subject was proposed by J. Ziv and Lempel in 1977 and later in 1978, and had derived to other family of compression algorithms, usually classified as substitutional. The paradigm behind their ideas is the following: to create a Copy codeword that expresses, if possible, the previous (or one of the previous) occurrences of the next incoming sub-string; otherwise the next character is expressed at the output end in its original form, tagged accordingly. Two papers from these authors had generated different approaches around the manner in which the Copy codewords are determined and constructed.
In “A universal algorithm for sequential data compression”; IEEE Transaction on Information Theory; IT-23; No. 3; 1977; pp. 337-343; is disclosed a window approach. In the specialized literature this concept is also known as ZL1, or “sliding dictionary”. The identifying characteristic of this algorithm is the concept of a window traveling through the original uncompressed data. In each step the incoming sub-string is searched in the window of the past buffer in order to find if there is sub-string that generates a match. The goal is to identify the longest identical match possible. As a result of the search, a pair <Position, Length> is established, that univocally specify the next incoming sub-string as a match against a sub-string from the past data window. This pair is then encoded and appended to the coded data stream. At the end, the window is then moved forward as to include the newly matched sub-string, discarding, if necessary, the oldest characters in the window as to keep the window size constant. In case that a sub-string at the input is not able to match any previous sub-string within the window of two or more symbols (threshold length), then the incoming character(s) are reflected in the output in its original form, also known as Literal from. Both opposite coding cases are conveniently marked, as to allow the decompression.
Extensive work had been done aimed as establishing better ways to solve the search tasks associated with this algorithm, and a multitude of practical software implementations abound for several operating systems. Enhancements to this approach had been made popular (by the method known as LZSS) by former Stac, reflected in disclosures of U.S. Pat. No. 5,016,009; U.S. Pat. No. 5,126,739; U.S. Pat. No. 5,414,425; U.S. Pat. No. 5,463,390; U.S. Pat. No. 5,506,580 from D. L. Whiting et al. A discussion around the search strategies and variations on ZL1 can be founded on T. Bell; “Modeling for text compression”; ACM Computer Surveys; Vol. 21; No. 4; December 1989; pp. 557-591. In a fundamental property of this approach, the Position pointer indicates a character-position in the window at which the matched sub-string starts. In the remaining text of this invention, we shall denote this representation of the pointer as “External”, for reasons that will become apparent in the disclosure of our invention.
One of the main drawbacks of ZL1 is the slow nature of the search process. In practical implementations, a data structure is constructed additionally to speed-up the search task. This structure normally contains the last N addresses where the individual symbols (that the same time represent the start of a substring) have been found within the window, and will be updated for each incoming symbol by inserting the new address (and eventually by deleting the oldest address in the concrete symbol sub-list). Still, the insertion in the structure of all the sub-strings that start in every external position constitute a consuming procedure and therefore it is required to establish a limit in the maximal Match Length that the search can identify. Besides, for practical reasons, the amount of stored N addresses where the substrings start have to be limited (for example, to the last 3 occurrences); altogether deriving in a non-exhaustive search producing non-optimal results.
To overcome this limitations, a hardware architecture had been proposed in US. Pat. No. 5,612,693, and later enhanced in U.S. Pat. No. 5,627,534; U.S. Pat. No. 5,874,907; U.S. Pat. No. 5,877,711 by Craft, D. J. implying the use of special Content Addressable Memory (CAM) for accomplishing parallel search in a limited-size window, in a method known as ALDC. The exhaustive search reached in this method compensates the reduced window-size (a result of the technological complexity of the CAM memory), and the overall compression rates are reported by the author to be similar to LZSS.
Another philosophy in the substitutional method is exposed in J. Ziv, A. Lempel; “Compression of individual sequences via variable-rate coding”; IEEE Transaction on Information Theory; IT-24, No. 5; September 1978; pp. 530-536. It constitutes the basis for the disclosure of W. L. Eastman, J. Ziv, A. Lempel et al; “Apparatus and methods for compressing data signals and restoring the compressed data”; U.S. Pat. No. 4,464,650, 1984. Further improvements are disclosed in T. Welch U.S. Pat. No. 4,558,302, 1985. We shall refer to the latter since it had been very popular for some time. See also T. Welch “A Technique for High Performance Data Compression”; Computer; Vol. 17; No. 6; June 1984; pp. 8-19 for details.
In Ziv-Lempel-Welch (ZLW), a dynamic dictionary is constructed (and hence the name of Dynamic Dictionary Method, as it is also known) that contains sub-strings of intrinsic variable length. Whenever a match is found for the incoming sub-string with a dictionary element, the incoming substring will be substituted on the output (encoded bit-stream) by the address of the matching dictionary element. The dictionary contains also in the lower addresses all the single symbols of the input alphabet, therefore single not-matching characters (whose matching length is lower than 2) can also be substituted by its corresponding node address and transferred to the output. If by a search in ZLW a sub-string S (initially consisting of 2 characters, and every new search adding a symbol) is found in the dictionary, then the dictionary will be examined again with the pair <S, Next symbol> until a non-match is found, i.e. Until the new pair is not included in the dictionary. At this point, the last address (pointer) of the dictionary is sent out, and the non-matched pair <S,Next symbol> will be inserted to the dictionary into a new address, and hence, the ZLW method can only increment in one symbol an entry in the dictionary per compression step. Finally, the processing of the input string will follow up starting again from the last non-matching symbol. As a consequence, symbol sub-strings included in a Dictionary codeword, will not automatically be included as dictionary substrings themselves (for example: dictionary substring x could contain the symbols ‘abcde’, but still the substrings ‘bcde’,‘cde’ and ‘de’ will not be created). Besides, on the first appearance of a substring the method is not capable to produce an encoding to the output stream with the address of the newer element, but just the previously existing substring. Only on a second appearance of the substring will be the new substring effective.
Hashing procedures are useful for this type of compression allowing to keep reasonable the amount of memory used for the dictionary, and increasing the speed in the dictionary-search. Still, ZLW fails to quickly “learn” the characteristics of the input data, as the update of the dictionary is an iterative process, and therefore, there exists a non-zero threshold level after which the method begins to operate efficiently. This implies that in short input files it produces rather disappointing results comparing with competing methods, and in general the compression rates reached are slightly below the results obtained with ZLSS. Others drawbacks appears if the intention is to use it in a continuous flow of input strings when the dictionary fills-up.
In U.S. Pat. No. 4,906,991 “Text substitution data compression with finite length search window” of E. Fiala, et. al. it is developed further a so-called Suffix-Trie disclosed in R. Morrison “Patricia-Practical Algorithm to retrieve information coded in alphanumeric”; Journal of the ACM; Vol. 15; No. 4; 1968; pp. 513-534, and proposed a series of compression algorithms, coming from the “External” representation of ZL1 and getting closer to the “Internal” representation of ZL2. Regrettably, central to its compressor is the fusion of several non-compressible symbols in an unique codeword (with a prefixed symbol-count). That in turn conducts to a high latency time in the output of the algorithm. Further, as a mainly software data compression approach, the Trie constructed requires substantial memory resources which prevented its implementation in hardware solutions.
Later works in the use of Trie structures for creating dynamic dictionaries had been described in U.S. Pat. No. 5,406,279 of Anderson; U.S. Pat. No. 5,640,551 of Chou; and U.S. Pat. No. 6,012,061 of Sharma. However, they had focused on additional techniques for maintaining/updating the Trie dictionary when it becomes full, as a result of processing big enough files as input. None of them had addressed the above mentioned limitations, nor provided the key properties for allowing fast hardware data compression solutions.