With the continued growth in demand for data transmission and data storage capacities, improved lossless data compression techniques are continually sought. As described in coassigned U.S. Pat. No. 5,652,878, of the many classes of lossless data compression, one of the most useful is the class of dictionary based compression techniques. Among these, the most useful today are the so-called Ziv-Lempel variable-length encoding procedures ascribed to J. Ziv and A. Lempel who suggested the "LZ1" length offset encoding scheme. The LZ1 process uses a fixed size sliding "history" window into the past source data string as the dictionary. Matches are encoded as a "match length" and an "offset" from an agreed position.
Because LZ1 scrolls the source string over a fixed sized sliding history window to create an adaptive dictionary, identification of duplicate "matching" strings in the source data is at first difficult, but becomes very efficient. Once a matching string is encoded as a "length" and "offset", the necessary decoding process is rapid and efficient, requiring no dictionary preload. The '878 patent illustrates an LZ1 compression technique which has been denominated the "adaptive lossless data compression" technique, or "ALDC".
All sliding window data compression processes suffer from what may be called "start-up losses" and "non-redundancy losses" in compression efficiency. Because each source string or block begins with an empty "dictionary", the first source symbols must be transmitted as raw words without compression. Similarly, a string of input data which has already been encrypted or compressed and lacks substantial redundancy, lacks the matches required for compression and the source symbols must also be transmitted as raw words without compression. The raw words must be identified as such by adding a bit for ALDC, the resultant characters called "literals", thereby leading to an expansion of the data.
Only after accumulating a substantial dictionary, by having the sliding window fill up with input data having substantial redundancy, are matches found for increasing numbers of substrings which allow encoding efficiency to build up.
In the original LZ1 arrangement, called "LZ77", all source input is output in the form of a three part token having the length and offset together with a flag, which is the first character of the compressed substring. Techniques such as ALDC overcome the problem when a non-redundant character is encountered by not sending the three part token, but instead providing the character unchanged, and providing it with a designation to indicate that it is not compressed. The unchanged raw character together with the designation is called a "literal". A typical designation is an added "zero" bit for each word of the source string. Thus, when encountering a string of non-redundant input data, the compression is expanded by a much smaller length than is likely with the original LZ1 technique. However, LZ1 techniques such as ALDC still must actually expand the data by one bit for every word, typically a 9/8 expansion to output them as literals.
Because of this problem, alternative dictionary based compression techniques have been designed to offer special advantages in particular circumstances. An example is LZ2 compression (also known as LZ78 or the related version known as LZW) which captures redundancies and maintains them in a dictionary for, e.g., an entire record, as described in "The Data Compression Book", M. Nelson, M & T Publishing, 1991, pp. 277-311. Thus, the opportunity for having redundancies is expanded, albeit at the cost of an expanded dictionary buffer. In LZ2, the expansion for literals may be more than one bit.
Another alternative is to not compress the data where expansion is a significant risk, which may be called "passthrough " mode.
In the situation where a string of non-redundant input data is encountered, it would be useful to switch to a second compression technique which may handle the strings of non-redundant input data more efficiently than the 9/8 expansion required to output them as literals.
Multibit control codes may be provided in the output data to indicate a special situation in data handling techniques, and that such a special situation may include switching between compression modes. If such a control code is used, it will degrade the efficiency of the compression by the length of the character.
The determination that it would be advantageous to make the switch is difficult. Coassigned U.S. Pat. No. 5,561,824 applies a total length data record concurrently to a compressor and a buffer. If the compressed record is expanded over the uncompressed record, the uncompressed record, and the entire following string of records, are selected for recording. The use of such a gross technique requires large buffering and lacks efficiency if the input data has any intermix of non-redundant input data and redundant data.
An alternative approach may be to examine the compressed data for a predetermined length of data and, if no compression has occurred, for example a long string of literals has been output, to then switch compression techniques. The difficulty with such an approach is that it would be very easy to get out of step with the input data and employ each technique at the wrong time.