The present invention relates to a method and apparatus for reversible compression of digital data.
For lossless (reversible) data compression methods, there are methods adopting LZ77/LZ78 invented by Ziv and Lempel as basics.
According to LZ77, a window buffer storing previous input data is looked up to find a longest-match string with current input data, and the position and length of the longest-match string found is outputted as coded data, thereby realizing data compression. This method is disclosed in U.S. Pat. No. 4,054,951 (Jackson, et al.) and xe2x80x9cA universal algorithm for sequential data compressionxe2x80x9d by Ziv, J. and Lempel, A., IEEE Transaction on Information Theory, Vol. 23, No. 3, pp. 337-343, May 1977.
Furthermore, according to LZ78, a dictionary generated based on previous input data is looked up to find a longest-match string with current input data, and a code stored in correspondence with the data string found is outputted. Furthermore, a new data string, generated by linking the longest-match string found with the next character, is additionally registered in the dictionary to facilitate a longer match in the next search. This method is disclosed in U.S. Pat. Nos. 4,464,650, 4,558,302, and xe2x80x9cCompression of individual sequences via variable-rate codingxe2x80x9d by Ziv, J. and Lempel, A., IEEE Transaction on Information Theory, Vol. 24, No. 5, pp. 530-536, May 1978.
A 2-32 KB buffer is appropriate for the window buffer employed in LZ77. Since input data is compared with data stored in the window buffer, the range of the longest-match search is limited to 32 KB of the previous input data at most. On the contrary, in LZ78, the range of search is not limited to the nearest data as in LZ77. The range of the longest-match search in LZ78 can be expanded as far back the previous input data as the size of the dictionary, regardless of the size of the window buffer. Therefore, while LZ77 makes use of a near correlation, LZ78 makes use of a far correlation, thus has versatility.
Moreover, in LZ78, compression is realized by mere comparison between the current input data and dictionary. The construction of the dictionary can be made so as to better be adapted to the data search. Accordingly, the longest-match search can be performed at high speed.
However, in LZ78, data decompression must be performed while generating and updating the dictionary. On the contrary, in LZ77, since the position and offset of the data string in the window buffer are provided as code data, decompression can be performed without generating a dictionary. Therefore, decompression can be performed at higher speed than LZ78.
In other words, LZ78 has versatility with respect to data and is capable of high-speed data compression. LZ77 has poor data versatility, but is capable of high-speed decompression processing.
When compression is performed using LZ77, it is necessary to look up the window buffer storing previous input data to find a longest match with current input data. In order to increase the speed of this processing, there are hash search methods disclosed in U.S. Pat. No. 4,701,745 (J. R. Waterworth), U.S. Pat. No. 5,049,881 (D. K. Gibson), U.S. Pat. No. 5,051,745 (P. W. Katz), and RFC-1951 (xe2x80x9cDeflate Compressed Data Format Specification version 1.3xe2x80x9d by P. Deutsch). The hash search proposed by these known documents is described below.
FIG. 1 is an explanatory view of the hash search. Reference numeral 10 denotes a window buffer. The area on the left hand of the line P stores already-compressed previous input data before compression. The area on the right hand of the line P stores input data subjected to compression. Assume that the size of the left area of the line P is 32 KB. An offset, indicative of a position in the left area of the line P, increments as it goes toward the left, with the line P as an origin.
Reference numeral 11 denotes a hash array H[i], which stores an offset of the window buffer, and the number of elements is 215=32768 entries. The length of the offset is 2 bytes.
FIG. 3 shows steps of compression processing. Description is provided according to this flowchart.
In step 301, an initial value (head address of input data) is given to a pointer C indicative of a current input data string. In step 302, H[i] is initialized to 0. Since an offset being 0 is improbable, this indicates that no data is stored in the offset. In step 303, it is determined whether or not there is more input data to be compressed. If not, the control ends. If yes, the control proceeds to step 304.
Reference numeral 12 in FIG. 1 denotes a character string of current input data. Provided that the first three characters are expressed by an array C[0] to C[2], the hash value h is calculated by the method shown in FIG. 2 (step 304 in FIG. 3).
Note in FIG. 2, the reference letter {circumflex over ( )} indicates an exclusive OR. Other reference letters comply with the C language. xe2x80x9cx less than  less than yxe2x80x9d indicates that x is shifted by y bit in the direction of higher bits. xe2x80x9cxandyxe2x80x9d indicates to AND x and y in units of bit. After calculating the hash value h, H[h] is compared with 0 in step 305. If H[h] is 0, it indicates that a three-character string having the hash value h has not yet occurred. Then, in step 306, a current offset of the current input data (offset of the first character) is stored in H[h]. Next in step 307, data C[0] having 1 byte is outputted. In step 308, the pointer C is incremented by 1 to enable processing of the next input data, and the control returns to step 303.
If H[h] is not 0 in step 305, it indicates that a three-character string having the same value as the calculated hash value h has occurred in the previous input data. The position m of the H[h] where the character string is located is extracted (step 309). Then in step 310, the current input data is compared with the previous input data located in the position m to obtain a longest-match length L. In step 311, m and L are subjected to Huffman coding, and the coded data is outputted. After obtaining the longest-match length L, the pointer C is incremented by L in step 312, and the control returns to step 303.
In the compression processing of LZ77, the processing speed can be increased by employing the above-described hash search. However, the comparison between the current input data and previous input data is not expanded to the data inputted far back in the past beyond the window buffer, as in LZ78. Therefore, for instance, with regard to data repeated in a cycle of 32 KB that is the size of the window buffer, there is no effect of compression. Meanwhile according to the compression method of LZ78, since generating and updating the dictionary is necessary at the time of decompression, an overhead is generated. Thus, the decompression processing speed is slower than LZ77.
To increase the speed of decompression processing, it is preferable to employ code data, indicative of the offset and length, to extract a corresponding data string from the window buffer at the time of decompression. However, coding the offset and length of the data limits the target of longest-match search to the previous input data stored in the window buffer in compression processing. Therefore, it is difficult to have both ways: increasing decompression speed, and expanding the data search target in compression.
The present invention has been proposed in view of the conventional problems, and has as its object to provide a data compression method, apparatus, computer program, and storage medium, which can realize lossless data compression at high speed while taking advantage of high-speed decompression.
According to the present invention, the foregoing object is attained by providing a data compression method of reading input data from a predetermined input storage area, searching previous input data that matches the input data, generating coded data based on code generation information corresponding to the previous input data, and outputting the coded data to a predetermined output storage area, comprising the steps of; reserving a dictionary for storing input data related information related to the previous input data, and code generation information of the input data in association with each other, thedictionary comprising an element area which includes link information for linking the element area to another element area; searching the dictionary reserved; registering input data related information, related to the previous input data, and code generation information of the input data in the element area; and tracking the element area according to the link information in the dictionary searching step to search input data related information in the dictionary, which corresponds to the input data related information related to current input data, then obtaining code generation information related to the input data related information, and generating coded data based on the code generation information.
Other features and advantages of the present invention will be apparent from the following description taken in conjunction with the accompanying drawings, in which like reference characters designate the same or similar parts throughout the figures thereof.