The present invention relates to data compression techniques, and more particularly to methods and apparatus for deriving a dictionary for use by a data compression system.
Lossless data compression attempts to map the set of strings from a given source into a set of predefined binary code strings so that each source string can be exactly reconstructed from its corresponding code string and some criterion about this mapping is optimized. One approach to studying lossless data compression algorithms assumes that there is a statistical model underlying the generation of source symbols. In this case, the most common primary objective in designing source codes is to minimize the average number of code symbols produced per source symbol. C. E. Shannon and W. Weaver, The Mathematical Theory of Communication, Univ. of Illinois Press, Urbana (1949), demonstrated that the expected number of code symbols per source symbol achieved by any lossless data compression technique is bounded from below by the binary entropy of the source. The redundancy of a source code is the amount by which the average number of code symbols per source symbol for that code exceeds the entropy of the source.
Many families of lossless data compression techniques have been devised and investigated. A variable-to-fixed length coder, for example, can be decomposed into a parser and a string encoder. The parser segments the source output into a concatenation of variable-length strings. Each parsed string, with the possible exception of the last one, belongs to a predefined dictionary with M entries; with the final parsed string being a non-null prefix of a dictionary entry. The string encoder maps each dictionary entry into a fixed-length codeword.
Variable-to-fixed length codes are considered to be particularly well-suited to compress data with a lot of predictability because dictionaries can be chosen so that there are long entries corresponding to frequently occurring strings. For a discussion of variable-to-fixed length codes and their suitability for compression of predictable data, see, for example, S. A. Savari, Variable-To-Fixed Length Codes for Predictable Sources, Proc. DCC ""98, Snowbird, Utah (April 1998) and S. A. Savari, Predictable Sources and Renewal Theory, Proc. ISIT ""98, Cambridge, Mass. (August 1998), each incorporated by reference herein.
Run-length codes were the first variable-length codes to be investigated, and these codes have long been recognized to be effective for binary, memoryless sources with small entropies. Tunstall considered the problem of generating an optimal variable-to-fixed length code for any discrete, memoryless source. B. P. Tunstall, Synthesis of Noiseless Compression Codes, Ph.D. Dissertation, Georgia Inst. Technology, Atlanta, Ga. (1967). Generalizations of the Tunstall code to sources with memory have been proposed, for example, in T. J. Tjalkens and F. M. J. Willems, Variable-To-Fixed Length Codes for Markov Sources, I.E.E.E. Trans. Information Theory, IT-33, 246-57 (1987), S. A. Savari and R. G. Gallager, Generalized Tunstall Codes for Sources with Memory, I.E.E.E. Trans Information Theory, IT-43, 658-68 (1997) and S. A. Savari, Variable-To-Fixed Length Codes and the Conservation of Entropy, Proc. ISIT ""95, Whistler, Canada (1995). The Lempel-Ziv codes are universal variable-to-fixed length codes that have become virtually standard in practical lossless data compression.
Run-length codes and the generalized Tunstall codes for sources with memory use dictionaries that have the property that any source sequence has a unique prefix in the dictionary. Under this assumption, Tunstall codes are the optimal variable-to-fixed length codes for discrete memoryless sources. A dictionary is said to be uniquely parsable if every source string, even those of zero probability, can be uniquely parsed into a concatenation of dictionary entries with a final string that is a non-null prefix of a dictionary entry. For example, consider a ternary source having the alphabet {0, 1, 2}. If the dictionary is {00, 1, 2}, the dictionary is not uniquely parsable because any source string beginning with the letters 0 1 cannot be parsed. The dictionary {00, 01, 02, 1, 2}, is uniquely parsable. However, adding the string 000 to the previous dictionary results in a new dictionary that is not uniquely parsable because the string 0 0 0 1 can either be segmented as (000)(1) or as (00)(01). This dictionary is said to be plurally parsable. It is noted that for a plurally parsable dictionary, a parsing rule must be specified to avoid ambiguities during the segmentation of the source string.
For any real number x, let [x] denote the smallest integer greater than or equal to x. As the length of the encoded source string increases, the number of code letters per source letter corresponding to a dictionary with M words and expected number LM of source letters per dictionary string approaches [log2 M]/LM with probability one. Thus, an optimal dictionary with M entries maximizes the average number LM of source letters per dictionary string.
Tunstall found a very simple algorithm to construct the optimal uniquely parsable dictionary with M entries. It is often convenient to picture the entries of a dictionary as the leaves of a rooted tree in which the root node corresponds to the null string, each edge is a source alphabet symbol, and each dictionary entry corresponds to the path from the root to a leaf. The tree corresponding to a uniquely parsable dictionary is complete in the sense that every intermediate node in the tree has a full set of edges coming out of it. For a dictionary with M entries, LM can be interpreted as the expected length of the dictionary tree. It is known that for a uniquely parsable dictionary, LM is the sum of the probabilities associated with each intermediate node in the tree, including the root. Therefore, an optimal uniquely parsable dictionary will correspond to a set of intermediate nodes with maximal probabilities. Unique parsability implies that for a discrete, memoryless source with an alphabet of size K, M=xcex1(Kxe2x88x921)+1 for some integer xcex1. Here, xcex1 is the number of intermediate nodes in the dictionary tree, including the root.
The Tunstall algorithm given below finds the optimal uniquely parsable dictionary for a discrete, memoryless source with an alphabet of size K:
1. Start with each source symbol as a dictionary entry.
2. If the total number of entries is less than M, then go to step 3, else stop.
3. Take the most probable entry "sgr" and replace it with the K strings that are single letter extensions of "sgr". Do not alter the other entries. Go to step 2.
In practice, many variable length codes such as the Lempel-Ziv-Welch code use dictionaries that are plurally parsable. In other words, each source sequence can be segmented into a concatenation of dictionary entries in at least one way, and there exist source sequences that can be parsed into a concatenation of dictionary entries in two or more ways. At a parsing point, the most common rule for designating the next parsed phrase from a plurally parsable dictionary is to select the longest dictionary entry that is a prefix of the unparsed source output.
A need exists for a plurally parsable dictionary that yields a significantly larger average length of a parsed string than that of the Tunstall dictionary of the same size.
Generally, a combinatorial approach is disclosed for analyzing a class of plurally parsable dictionaries for predictable, discrete, memoryless sources. A class of plurally parsable dictionaries are disclosed for a binary, memoryless source that outperforms the Tunstall code, when the probability of one of the symbols having a binary value of one is sufficiently close to one. For sources with a very small probability of a symbol having a value equal to one, p1, the Tunstall dictionary is inefficient, since all but one of the Tunstall dictionary entries ends with a one and each such entry is rarely used.
According to a feature of the invention, a dictionary derivation process derives a dictionary that provides better compression than the Tunstall dictionary, if such a better dictionary exists, given the probability of a symbol having a value equal to zero, p0, and the desired size of the dictionary, M. Generally, the dictionary derivation process selects a Tunstall dictionary having a size Mxe2x88x92n, where nxe2x89xa71. Thereafter, all n zero entries are added to the Tunstall dictionary. For the case where n equals one, the entry is comprised of a string of l zeroes. The value of l is obtained from the following equation, where l=i x (Mxe2x88x922):
r(ixe2x88x921)xe2x89xa6p0Mxe2x88x922xe2x89xa6r(i),
where r(k) is the positive real root of the equation:             ∑              j        =        0            k        ⁢          x      j        =  k