This invention relates to data compression, and more particularly, to a system and a method for compressing a data sequence using a variant of a Lempel-Ziv technique.
Data compression techniques are used to reduce the amount of data to be stored or transmitted for reducing the storage capacity and transmission time respectively. In either case it is necessary to provide a corresponding decompression technique to enable the original data to be reconstructed.
Many data compression and decompression techniques are known, with the Lempel-Ziv (LZ) technique and its variants proving to be very popular. U.S. Pat. No. 4,558,302, Welch, entitled xe2x80x9cHigh Speed Data Compression and Decompression Apparatus and Method;xe2x80x9d U.S. Pat. No. 4,701,745, Waterworth, entitled xe2x80x9cData Compression System;xe2x80x9d and U.S. Pat. No. 4,814,746, Miller et al., entitled xe2x80x9cData Compression Methodxe2x80x9d are patents that disclose some of these LZ techniques. One of the LZ variants is known as the LZ Opperhumer (LZO) technique. FIG. 1 shows an output codestream obtained by compressing an input character sequence using the LZO technique. The output codestream includes codewords interspersed with non-matchable sequences of characters from the input character sequence. The codewords reference sequences of characters which have previously appeared when decompressing the output codestream to allow the original input character sequence to be rebuilt from the codestream.
There is room for improvement in the LZO technique. Because the codewords are interspersed among the non-matchable character sequences, the codewords need to be byte-aligned. If the codewords are of a variable length, delimiters are also required. The need for byte-alignment of the codewords and for the delimiters results in extraneous bits being included in the LZO output codestream. If the codewords are of a fixed length, the delimiters are not necessary. However, such fixed-length codewords can only represent length of character sequences of up to a value representable by data items in the codewords.
According to an aspect of the present invention, there is provided a method for compressing an input sequence of data portions. The input sequence is sequentially traversed, portion by portion, to determine if a first sequence of portions starting with each portion is matchable with a second sequence of previously traversed portions. If a first sequence is matchable with a second sequence, a length of the first sequence and an offset between the first sequence and the matching second sequence are noted. A stream of at least one non-matchable sequence of data portions in the input sequence is recorded in sequential order. An ordered sequence of codewords separate from the stream is generated. Each codeword includes three data items denoting a length of a non-matchable sequence preceding a matchable first sequence, the offset associated therewith and the length of the matchable first sequence.
According to another aspect of the present invention, there is provided a compressing system having means for compressing an input sequence of data portions as described above.
According to another aspect of the present invention, there is provided a program storage device readable by a computing device, tangibly embodying a program of instructions, executable by the computing device to perform the above method for compressing an input sequence of data portions.