1. Field of The Invention
The present invention generally relates to the field of lossless data compression techniques in processing digital data signals, wherein the digital data signals are compressed and subsequently reconstituted by transforming a body of data to a typically smaller representation from which the original can be reconstructed at a later time. Further, the present invention relates to the field of lossless data compression, wherein the digital data that is compressed and then subsequently decompressed is always kept identical to the original. More particularly, the present invention relates to the field of encoding algorithms for a data compression method which utilizes tokenizing techniques to achieve high compression speed and high compression ratio.
2. Description of The Prior Art
Several methods for performing digital data compression are known in the prior art. Generally, an alphabet is a finite set containing at least one element. The elements of an alphabet are called characters. A string over an alphabet is a sequence of characters, each of which is an element of that alphabet. A common approach to compress a string of characters is textual substitution. A textual substitution data compression method is any data compression method that compresses text by identifying repeated substrings and replacing some substrings by references to other copies. Such a reference is commonly known as a pointer and the string to which the pointer refers is called a target. Therefore, in general, the input to a data compression algorithm employing textual substitution is a sequence of characters over some alphabet and the output is a sequence of characters from the alphabet interspersed with pointers.
The following prior art patents are representative of known prior art data compression methods:
1. U.S. Pat. No. 4,464,650 issued to Eastman et al. on Aug. 7, 1984 for "Apparatus And Method For Compressing Data Signals And Restoring The Compressed Data Signals" (hereafter the "Eastman Patent"). PA1 2. U.S. Pat. No. 4,558,302 issued to Welch on Dec. 10, 1985 for "High Speed Data Compression And Decompression Apparatus And Method" (hereafter the "Welch Patent"). PA1 3. U.S. Pat. No. 4,586,027 issued to Tsukiyama et al. on Apr. 29, 1986 for "Method And System For Data Compression And Restoration" (hereafter the "Tsukiyama '027 Patent"). PA1 4. U.S. Pat. No. 4,560,976 issued to Finn on Dec. 24, 1985 for "Data Compression" (hereafter the "Finn Patent"). PA1 5. U.S. Pat. No. 3,914,586 issued to Mcintosh on Oct. 21, 1975 for "Data Compression Method And Apparatus" (hereafter the "Mcintosh Patent"). PA1 6. U.S. Pat. No. 4,682,150 issued to Mathes et al. on Jul. 21, 1987 for "Data Compression Method And Apparatus" (hereafter the "Mathes Patent"). PA1 7. U.S. Pat. No. 4,872,009 issued to Tsukiyama et al. on Oct. 3, 1989 for "Method And Apparatus For Data Compression And Restoration" (Hereafter the "Tsukiyama '009 Patent"). PA1 8. U.S. Pat. No. 4,758,899 issued to Tsukiyama on Jul. 19, 1988 for "Data Compression Control Device" (Hereafter the "Tsukiyama '899 Patent"). PA1 9. U.S. Pat. No. 4,809,350 issued to Shimoni et al. on Feb. 28, 1989 for "Data Compression System" (hereafter the "Shimoni Patent"). PA1 10. U.S. Pat. No. 4,087,788 issued to Johannesson on May 2, 1978 for "Data Compression System" (hereafter the "Johannesson Patent"). PA1 11. U.S. Pat. No. 4,677,649 issued to Kunishi et al. on Jun. 30, 1987 for "Data Receiving Apparatus" (hereafter the "Kunishi Patent"). PA1 12. U.S. Pat. No. 5,016,009 issued to Whiting et al. on May 14, 1991 for "Data Compression Apparatus and Method" (hereafter "the '009 Whiting Patent"). PA1 13. U.S. Pat. No. 5,003,307 issued to Whiting et al. on Mar. 26, 1991 for "Data Compression Apparatus with Shift Register Search Means" (hereafter "the '307 Whiting Patent"). PA1 14. U.S. Pat. No. 5,049,881 issued to Gibson and Graybill on Sep. 17, 1991 for "Apparatus and Method For Very High Data Rate- Compression Incorporating Lossless Data Compression And Expansion Utilizing A Hashing Technique" (hereafter "the '881 Patent").
In general, as illustrated by the above patents, data compression systems are known in the prior art that encode a stream of digital data signals into compressed digital code signals and decode the compressed digital code signals back into the original data. Various data compression systems are known in the art which utilize special purpose compression methods designed for compressing special classes of data. The major drawback to such systems is that they only work well with the special class of data for which they were designed and are very inefficient when used with other types of data. The following compression systems are considered general purpose.
The best known and most widely used general purpose data compression procedure is the Huffman method. The Huffman method maps fixed length segments of symbols into variable length words. The Huffman method further involves calculating probabilities of the occurrences of certain symbols and establishing a tree having leaves for symbols with certain probabilities and new nodes established from lower probability symbols which nodes are also placed on the tree.
The Huffman method of data compression has many limitations. The encoding procedure of the Huffman method requires prior knowledge of the statistical characteristics of the source data. This is cumbersome and requires considerable working memory space. In addition, the Huffman method requires intensive calculations for variable bit compression. Moreover, the Huffman method requires a dictionary in the output stream for reconstruction of the digital signal or requires a prior knowledge of the dictionary which limits the applicability to specific types of data.
A second well known data compression technique is the Tunstall method, which maps variable length segments of symbols into fixed length binary words. The Tunstall method also has many of the disadvantages of the Huffman method and further has the constraint that the output string consists of fixed length binary words.
The third well known data compression technique is the group of the Lempel-Ziv ("LZ") methods. A typical LZ method maps variable-length segments of symbols into various length binary words. A problem with the LZ methods is that the required memory space grows at a non-linear rate with respect to the input data. An improved variation of the LZ method is disclosed by and claimed in the Eastman Patent. This new method taught in the Eastman Patent, however, has several major disadvantages: (a) the method requires the creation of a searchtree database and therefore requires storage room for the dictionary; (b) the amount of achievable compression is heavily dependent on the dictionary; (c) management and searching of the dictionary is time consuming, yielding low data rate-compression factor product; (d) the growth characteristics of the dictionary requires N-1 input data string occurrences of string of length N in order to establish string in the dictionary. This results in reduced compression efficiency; and (e) in the worst case, the growth of output data block is tied directly to the size of the dictionary. Making the dictionary larger can improve overall compression for compressible data, but yield larger percentage growths for incompressible data because more bits are required to represent fixed length dictionary pointers. Finally, the dictionary must be reconstructed during expansion, resulting in a slower reconstitution rate and more required memory space.
The method disclosed in the Welch Patent is very similar to the LZ method described in the Eastman Patent and also includes all of the basic problems of the Eastman Patent method. The basic difference is that instead of storing the dictionary in a tree node type structure, the Welch Patent method is explicitly compressing an input stream of data character signals by storing in a string table strings of data character signals encountered in the input streams. This has the additional disadvantage of requiring more storage than the LZ method. While it does provide the advantage of being faster if the number of strings that must be searched is small it still has the poor dictionary growth characteristics of other LZ methods, such as the one disclosed by the Eastman Patent.
The data compression algorithms disclosed by the two Whiting Patents are very similar. The '009 Whiting Patent disclosed a data compression algorithm which maintains an independent "history array means" as a separate dictionary of input data. It also maintains an "offset array means" which is a supportive linking table in addition to a hash table. The '307 Whiting Patent discloses a data compression algorithm which maintains an independent "shift register" as a separate dictionary of input data. It further utilizes a broadcast channel for searching simultaneously the entries of the shift register for matching substrings. However, both Whiting Patents suffer the drawback of having a "history means" which requires additional memory and processing time.
The remaining patents which discuss compression algorithms include in the process the requirement of creating a dictionary, either in the form of a tree or a series of strings or similar arrangement which requires substantial memory and storage for the dictionary or the strings and the time consuming process of searching the dictionary, yielding a low data rate-compression factor product. There is a significant need for an improved method for compressing data which eliminates the problems discussed above and provides a faster and more efficient method of compressing the data while at the same time retaining most of the advantages of prior systems.
The '881 Patent discloses a method and apparatus for compressing digital data that is represented as a sequence of characters drawn from an alphabet. An input data block is processed into an output data block composed of sections of variable length. Unlike most other prior art methods which emphasize the creation of a dictionary comprised of a tree with nodes or a set of strings, the method disclosed in the '881 Patent creates its own pointers from the sequence characters previously processed and emphasizes the highest priority on maximizing the data rate-compression factor product.
One of the many advantages of the '881 Patent is that the compressor can process the input data block very quickly, due to the use of previously input data acting as the dictionary combined with the use of a hashing algorithm to find candidates for string matches and the absence of a traditional string matching table and associated search time. The result of the method disclosed in the '881 Patent is a high data rate-compression factor product achieved due to the absence of any string storage table and matches being tested only against one string.
A typical data compression method includes two essential algorithms: a matching algorithm and an encoding algorithm. These two essential algorithms are relatively independent of each other. In pursuing a data compression method with higher compression ratio and higher compression speed, the present invention disclosed a new encoding algorithm which utilizes a token stacking technique. This new encoding algorithm, when incorporated with the matching algorithm disclosed by the '881 Patent, can enhance the performance of the data compression process. Moveover, this new encoding algorithm may be incorporated with any other matching algorithm incorporated with other data compression algorithms and enhance their performance by improving the encoding algorithm.