1. Field of the Invention
The invention relates to apparatus for processing data signals particularly with respect to apparatus and method for compressing data signals and reconstituting the compressed data signals.
2. Description of the Prior Art
Data compression systems are known in the prior art that encode a stream of digital data signals into compressed digital code signals and decode the compressed digital code signals back into the original data. The objective of data compression systems is to effect a savings in the amount of storage required to hold or the amount of time required to transmit a given body of digital information. The compression ratio is defined as the ratio of the length of the encoded output data to the length of the original input data. The smaller the compression ratio, the greater will be the savings in storage or time. By decreasing the required memory for data storage or the required time for data transmission, compression results in a monetary savings. If tapes or disks are utilized to store data files, then fewer tapes or disks are required for storing compressed files. If telephone lines or satellite links are utilized for transmitting digital information, then lower costs result when the data is compressed before transmission.
For example, it may be desired to transmit the contents of a daily newapaper via satellite link to a remote location for printing thereat. Appropriate sensors may convert the contents of the newspaper into a data stream of serially occurring symbol signals for transmission via the communication link. If the millions of symbols comprising the contents of the newspaper were compressed before transmission and reconstituted at the receiver, a significant amount of transmission time would be saved.
As a further example, when an extensive data base such as an airline reservation data base or a banking system data base is stored for archival purposes, a significant amount of storage space would be saved if the totality of symbol signals comprising the data base were compressed prior to storage and reexpanded from the stored compressed files for later use.
To be of practical and general utility a digital data compression system should satisfy four basic criteria. The system should be reversible, universal, asymptotically optimal and linear. In order for a data compression system to possess the property of reversibility it must be possible to reexpand or decode the compressed data back into its original form without any alteration or loss of information. The decoded and the original data must be identical and indistinguishable with respect to each other. The property of reversibility is synonomous with that of strict noiselessness as used in information theory.
For a digital data compression system to be universal one common procedure should be applicable to all data. No fore-knowledge of data source characteristics should be required or assumed and the compression method should be adaptive to any changes in the data source characteristics if they occur. The data must possess the characteristic of redundancy to be compressible.
A digital data compression system is asymptotically optimal if in the limit of increasingly long input data sequences, the system performs as well as or better than any other existing or conceivable data compression system. Special purpose data compression systems are known in the prior art designed for data with specific characteristics. Although a universal or general purpose data compression system will not always perform as well as a special purpose system on a small body of the specific data, a universal system that is asymptotically optimal will perform as well in the long run. As the amount of data to be compressed is increased without limit, the universal asymptotically optimal system should perform as well as any other method including a special purpose method designed for the data in question.
A data compression system is linear when the amount of processing time and the amount of working memory space required for the system increase only proportionately with the amount of data to be processed. The time required to execute the data encoding and decoding procedures should not grow faster than the amount of data to be processed. Thus the property of linearity ensures that the time required to compress large files will not exceed the available time regardless of the device speeds of the data processing components utilized. The working memory space required to execute the procedures also should not grow faster than the data, and it should be possible to limit the memory space to a fixed arbitrary size while continuing to process further input data if desired. Under this last constraint the compression may not be optimal.
Various data compression systems are known in the art, many of the systems utilizing special purpose compression methods designed for compressing special classes of data. Although such systems may be inexpensive to utilize and very effective when applied to the specific type of data for which they were designed, they can fail significantly, even causing expansion, when applied to data types for which they were not designed. In general purpose situations where the type and characteristics of the data to be encountered are not known in advance and may change during processing, such special purpose data compression systems, whatever their merits in special situations, are not practicable. Examples of data processing systems that can not be utilized for general application because of their lack of the property of universality utilize such methods as run-length encoding, zero-suppression, null-suppression and pattern substitution. One class of commonly utilized data compression methods consists of smoothing and filtering procedures. Smoothing and filtering techniques, which may be considered as special purpose and thus unsuitable in a general context, are also undesirable because of their lack of reversibility. The reexpanded data are only an approximation to the original data rendering these techniques unsuitable for applications such as banking, intelligence, scientific data, computer programs, command and control words, and the like.
The above described special purpose data compression systems suffer from the disadvantage of lacking the property of universality and in some instances of lacking reversibility. The present invention, to be described hereinbelow, provides a system for general use, for example in a computer disk storage system or in a data transmission system, where the type and characteristics of the data to be encountered are not known in advance and may change during processing.
General purpose data compression procedures are also known in the prior art, three relevant procedures being the Huffman method, the Tunstall method and the Lempel-Ziv method. The Huffman method is widely known and used, reference thereto being had in the article of D. A. Huffman entitled "A Method for the Construction of Minimum Redundancy Codes", Proceedings IRE, 40, 10 pages 1098-1100 (September 1952). Reference to the Tunstall algorithm may be had in the Doctoral thesis of B. P. Tunstall entitled "Synthesis of Noiseless Compression Codes", Georgia Institute of Technology (September 1967). Reference may be had to the Lempel-Ziv procedure in a paper authored by J. Ziv and A. Lempel entitled "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory, IT-23,3, pages 337-343 (May 1977). Further reference to the Lempel-Ziv procedure may be found in a paper authored by Messrs. Ziv and Lempel entitled "Compression of Individual Sequences Via Variable-Rate Coding", IEEE Transactions on Information Theory, IT-24,5, pages 530-537 (September 1977).
The best known and most widely utilized of these general purpose data compression procedures is the Huffman method. The Huffman procedure maps fixed-length segments of symbols into variable length words. The construction of a Huffman code for a set of n symbols x(l), . . . , x(n) with associated symbol probabilities p (l), . . . , p(n) is effected by constructing a binary tree whose leaves (external nodes) represent the symbols x(l), . . . x(n). The Huffman procedure involves selecting two leaves x(i), x(j) with lowest associated probabilities and combining these leaves to form a new node, their father, at a higher level in the tree. This new father node then has an associated probability of p(i)+p(j) and branches to x(i) and x(j). Next the two leaves x(i) and x(j) are removed from the set under consideration and the new father node is added to the set. This procedure is repeated until only one node, the root, with associated probability 1 remains. The resulting tree is utilized to assign code words to the symbols of the alphabet in the following manner. For each successive pair of branches emanating from a node, starting at the root, 0 is assigned to one branch and 1 is assigned to the other branch. The code word assigned to symbol x(k) is given by the assignments to the successive branches in the path from the root to the leaf x(k).
The Huffman data compression procedure suffers from two limitations. Firstly, the Huffman procedure operates under the constraint that the input data to be compressed be parsed into fixed-length segments of symbols. Although the Huffman procedure provides the best compression ratio that can be obtained under this constraint, when the constraint is relaxed it is possible to obtain significantly better compression ratios by utilizing other procedures. The present invention to be described hereinbelow does not operate under this constraint and thus achieves significantly lower compression ratios than does Huffman coding notwithstanding that Huffman coding is widely considered to be optimal. Secondly, Huffman coding requires foreknowledge of the statistical characteristics of the source data. The Huffman procedure operates under the assumption that the probability with which each fixed-length input segment occurs is known. This requirement of the Huffman procedure can, in practice, be obviated by the use of an adaptive version of the procedure which accumulates the necessary statistics during processing of the data. This, however, is cumbersome, requires considerable working memory space and performs suboptimally during adaptation. The present invention to be described hereinbelow does not require any a priori knowledge of the characteristics of the source data.
The Tunstall algorithm, which maps variable-length segments of symbols into fixed-length binary words, is complementary to the Huffman procedure with the fixed-length constraint now applied to the output segments instead of to the input segments. In the Tunstall procedure the output code word length is fixed with the consequence that the number n of code words and the set of code words are known in advance. The objective of the procedure is to make the set of messages (input segments) created by the input parsing as nearly equi-probable as possible. The procedure beings with a basic message set consisting of the, say m, symbols comprising the input alphabet. The most probable message is removed from the set and replaced by m new messages, each of which is the removed message suffixed by one of the m input alphabet symbols. This procedure is continued until the message set contains n messages. The n code words can then be assigned to these n messages in any manner desired.
The Tunstall procedure constrains the output string thereof to consist of fixed-length binary words. Under this constraint the procedure is asymptotically optimal with respect to other procedures having the same constraint. Since the present invention to be hereinafter described does not have a fixed-length constraint on either the input words or the output words, the present invention generally outperforms the Tunstall procedure. Like the Huffman procedure, the Tunstall procedure requires a foreknowledge, and in the case of the Tunstall procedure a very extensive foreknowledge, of the source data probabilities. Again this foreknowledge requirement can be obviated to some degree by utilizing an adaptive version which accumulates the statistics during processing of the data with the same concomitant disadvantages experienced by the Huffman procedure as discussed above.
Unlike the present invention, neither the Huffman nor the Tunstall codes have the ability to "extend the source", that is, to encode increasingly longer combinations of source symbols. The present invention gradually increases the lengths of words it is encoding, at least until memory is saturated, and thereby is able to compensate for dependencies in the probabilities of occurrence of source symbols. The Huffman and Tunstall codes necessarily treat probabilities of symbol occurrence as though they were independent, and thereby give inferior performance when dependencies exist.
The prior art Lempel-Ziv procedure, which maps variable-length segments of symbols into variable-length binary words, is asymptotically optimal when there are no constraints on the input or output segment lengths. In this procedure the input data string is parsed into adaptively growing segments, each segment consisting of an exact copy of an earlier portion of the input string suffixed by one innovative symbol from the input data. The copy which is to be made is the longest possible and is not constrained to coincide with an earlier parsed segment. The code word which replaces the segment in the output contains information consisting of a pointer to where the earlier copied portion begins, the length of the copy, and the innovative symbol.
One of the problems with the prior art Lempel-Ziv algorithm was that is did not exhibit the property of linearity since the required memory space grew at a non-linear rate with respect to the input data. Additionally, no means were available for locating the longest earlier match in linear time (time proportional to the length of the new segment). The encoding of the joint information of the pointer to the earlier match, the length of the earlier match and the innovative symbol was inefficient, compounding the impracticability of the algorithm.
The prior art, Lempel-Ziv algorithm, included a modification wherein the copy which is to be made is the longest possible and is constrained to coincide with an earlier parsed segment. The prior art Lempel-Ziv algorithm, including the modification, remains essentially an unrealized concept for data compression. The prior art algorithm, including the modification, has not been utilized since heretofore there has not been available any practical implementation of the algorithm.
The present invention to be hereinafter described provides practical implementations of the modified Lempel-Ziv algorithm to provide data compression apparatus and method that is asymptotically optimal without any word length constraints with respect to either the input or output data. The present invention also possesses the properties of linearity, reversibility and universality. It compresses data as well as or better than any heretofore known data compression apparatus or procedure and is faster than prior art procedures.