1. Field of the Invention
The present invention relates generally to a data compression system and a data compression device which can restore a compressed data without distortion. More specifically, the invention relates to a data compression system and a data compression device which can improve coding speed by registering a character string in a retrieving means for retrieving a historical data array.
2. Description of the Related Art
As a data compression method for storing a large size data in a storage means in a reduced data size, a text replacing type distortionless data compression system, known as Ziv and Lempel system has been widely used since later 1980's. Amongst, a data compression algorithm called as LZ77 type is frequently employed as a file compression tool for a high compression rate.
The LZ77 type data compression method is characterized by a historical data array for storing input data for which coding is completed. The historical data array is frequently used in a circular ring like manner for facilitating updating. Namely, in the historical data array, a predetermined length of region, called as a lookahead region, is provided to store the data to be compressed. Then, according to progress of compression, the lookahead region is shifted in circular ring manner.
In the LZ77 type data compression method, at first, a data to be compressed (data in the lookahead region) and a character string beginning from a portion other than the lookahead region in the historical data array, namely the character string beginning from an entry storing the input data for which compression has already been completed are checked to retrieve a character string, on which a match is established. If sufficiently long matched character string can be found, the position of the leading end of the matched character string in the historical data array and a length thereof is fed as a codeword of the matched character string in the lookahead region.
Here, by making the value of the length of the matched character string to be checked or so forth variable as required for variable coding, compression efficiency can be further improved. By uniting the leading position of the matched character string to the most recent one and expressing the leading position of the matched character string at the relative distance to the leading end of the lookahead region, relatively small value can be used, and improvement of the compression rate by variable length coding can be promoted. The historical data array is updated so that most recent coded data can be stored.
Namely, the LZ77 type data compression system takes the most recently coded data as "dictionary" and the character strings to be registered in the dictionary are all of the character strings to be shorter than or equal to a predetermined length starting from the entry storing the data already compressed in the historical data array. This means that number of the character strings registered in the dictionary is greater than that in LZ78 compression method which is another form of Ziv and Lempel system. Accordingly, it makes it easier to find out longer matched character string to achieve higher compression rate.
As a measure in the case where sufficiently long matched character string cannot be found, there is an algorithm called as LZSS type method. The LZSS type data compression system employs two coding modes. One is a copy mode to be used when a sufficiently long matched character string (normally, more than or equal to 3 symbols in the case where one symbol is 1 byte), and the other is a literal mode to be used when the sufficiently long character string cannot be found.
In order to discriminate the copy mode and the literal mode, mutually distinct codeword are used for respective modes. In case of the copy mode, the codeword is consisted of a bit string indicative of the matching length and a bit string indicative of the matching position. In case of the literal mode, the leading one character of the character string to be compressed is used as the code word as is. For discrimination between the literal mode and the copy mode, a measure is taken to provide one bit of flag or to newly generate a greater alphabetic character by combining the matching length and the alphabetic character of the input data.
When LZ77 type data compression is performed, it is important to perform comparison of the input character string to be compressed and the character strings in the historical data array at high speed. As shown in FIG. 6, a retrieving portion 62 is provided for performing retrieval of character string by means of certain retrieving means with respect to the historical data array stored in a historical data array storage portion 61, and all of the character strings in the historical data array have to be registered in the retrieving portion 62. Currently, in view of the fact that, in addition to retrieval in the historical data array in parallel by means of a dedicated hardware, the historical data array is sequentially updated, a retrieving means utilizing tree structure, such as a binary tree, trie and so forth or a retrieving means having a linear list upon collision in a hash table may be employed. These retrieving means performs retrieval using a pointer indicative of the position of the leading end of the character string in the historical data array. Namely, these retrieving means read the character strings in the historical data array on the basis of the pointer indicative of the leading position of the character string and compare with the character string as the target.
One example of the data compression technology having the retrieving means employing the binary tree has been disclosed in "IEEE Transactions on Communications" (Vol. 34, No. 12, 1986, P. 1176-1182). On the other hand, another example of the data compression technology having the retrieving means employing the trie structure has been disclosed in U.S. Pat. No. 4,906,991, for "Textual Substitution Data Compression with Finite Length Search Windows". On the other hand, one example of the data compression technology having the retrieving means employing the hash table and the linear list has been disclosed in Japanese Unexamined Patent Publication (Kokai) No. Heisei 3-68219, for "Data Compression Apparatus and Method".
FIG. 7 is an illustration showing a condition of the historical data array stored in the historical data array storage portion 62. FIGS. 8, 9 and 10 show construction for respective retrieving means at the condition of the historical data array as illustrated in FIG. 7. FIG. 8 shows the retrieving means employed by the binary tree, FIG. 9 shows the retrieving means employing by the trie, and FIG. 10 shows the retrieving means having the linear list for collision in the hash table. The character strings overflown from the historical data array is required to be deleted from the retrieving means, the retrieving method has to be low in the cost for deletion.
In the LZ77 type data compression system, when the long matched character string can be found, higher compression rate can be achieved. Accordingly, if matching can be established on greater number of character strings, higher likelihood can be achieved to find longer matched character string to improve compression rate. For this purpose, the size N of the historical data array and the maximum length L of the matched character string are made greater. However, by making the size N of the historical data array and the maximum length L of the matched character greater, three problems arise.
First of all, increasing the size N of the historical data array and the maximum length of the matched character string inherently cause increase the number of bits required for expressing the position and length of the character string to possibly low compression rate. However, this problem may be solved in certain extent by variable length coding of the position and length of the character string. Assuming one symbol is 1 byte (8 bits), under the premise that the position and the length of the character string is coded by the fixed length coding, it is said to be optimal to have the historical data array length of approximately 8192 and the maximum length L of the matched character string of approximately 64. However, by employing the variable length coding, the best compression rate is attained at the historical data array length N of 32768,65536 and the maximum length L of the matched character string of 256 to 1024. Therefore, it can be adapted to much greater value.
Secondly, increasing of the size N of the historical data array and the maximum length L of the matched character string should results in greater amount of memory consumption upon coding. However, if the size N of the historical data array and the maximum length L of the matched character string are in the extent of N=65536 and L=2048, it merely requires only about 2 Mbytes of memory including the retrieving means. Therefore, it can be executed without causing any problems in the currently available workstation or the high performance personal computer.
Thirdly, increasing of the size N of the historical data array and the maximum length L of the matched character string significantly lower coding speed. One indicia for estimating the coding speed in the LZ77 type coding is number of times of comparison of the characters. Assuming the data length is M, the number of times of comparison of characters in the binary tree, trie, the hash table with the linear list may be approximated as follows:
No special retrieving means is provided . . . M'LN PA1 Retrieving means of binary tree is employed . . . ML log.sub.2 N PA1 Retrieving means of trie is employed . . . ML PA1 Retrieving means of hash table is employed . . . M'LN' PA1 historical data array storage means having a plurality of entries for storing input data and for storing coded input data; PA1 retrieving and recording means for performing retrieval of matched character strings through comparison of the input data to be compressed and the coded input data stored in said historical data array storage means and recording character strings in said historical data array storage means; PA1 recording control means for selecting said character strings in said historical data array storage means to be recorded in said retrieving and recording means; and PA1 coding means for performing coding of said matched character strings in the input data to be compressed by using the length of said matched character strings and the position of said matched character strings included in said recorded character strings in said retrieving and recording means. PA1 historical data array storage means having a plurality of entries for storing input data and for storing coded input data; PA1 retrieving and recording means for performing retrieval of matched character strings through comparison of the input data to be compressed and the coded input data stored in said historical data array storage means and recording character strings in historical data array storage means; PA1 recording control means for selectively recording character strings beginning from positions in a range between the leading end of said matched character strings stored in said historical data array storage means and a given digit represented by a threshold value from the leading end of said matched character strings in said retrieving and recording means; and PA1 coding means for performing coding of said matched character storings in the input data to be compressed by using the length of said matched character strings and the position of said matched character strings included in said recorded character strings in said retrieving and recording means.
In the LZ77 type data compression, the data is divided into a plurality of partial character strings. Then, the codeword corresponded to each partial character string. In the foregoing formulae, M' represents the number of the partial character strings. On the other hand, N' represents an average length of the linear list.
In case of retrieval employing the binary tree or the trie structure, retrieving is required to register the character string at an appropriate position of the retrieving means. Therefore, it becomes necessary to retrieve the character strings having lengths shorter than or equal to the predetermined length from all of the positions in the data by the retrieving means.
In contrast to this, in case of the retrieving means employing the hash table, retrieval of the character string is only required for the character strings beginning from the leading end of the character string to be coded. Registration of character string beginning from the intermediate portion of the matched character string can be performed by calculation of a hash function and re-writing of one record in the hash table. Therefore, retrieval of the character string is not required. Thus, in the method employing the hash table, the number of times of comparison of the character strings is expressed not by the data length M but as a multiple of the partial character string M'. The average linear list length N' may become equal to the length N of the historical data array in the maximum length. Unless sufficiently large number of hash table are provided, it cannot be made smaller. On the other hand, allowing slight lowering of the compression rate, the length N' of the average linear list can be cut off at appropriate number of times to permit making the constant smaller.
Thus, when a difference between the number M' of the partial character strings and the data length M becomes large, higher speed process can be expected for the retrieving means employing the hash table, in which the number of times of comparison of the character strings is proportional to the number M' of the partial character strings in comparison with the retrieving means employing the binary tree or trie. Particularly, in case of the data having large redundancy, since the difference between the data length M and the number M' of the partial character string becomes large, significant difference may be caused in the execution period between the retrieving means employing the hash table and the retrieving means employing the binary tree or the trie. Setting that the length N of historical data array at 65536 and the maximum length L of the matched character string at 1024, the substantial degradation in the coding speed is inherently caused in the retrieving means employing the binary tree or the trie. Therefore, in such case, data compression is not practically possible unless hash table is employed.
However, even by the retrieving means employing the hash table, when the longest matched character string is to be obtained, the average linear list length N' becomes large, Therefore, number of comparison of the character strings required for one retrieval cycle may become unacceptably large. In that case, data compression efficiency becomes comparable to the case where the no special retrieving means is provided and thus can lower execution speed.
One method for speeding up retrieving process for the character string in the case that the tree structure is employed in the retrieving means, has been disclosed in the above-identified U.S. Pat. No. 4,906,991. In the above-identified U.S. patent, there is disclosed a technology for registering only character strings beginning from the leading end of the character string to be coded in the retrieving means which employs the trie structure. However, if only the character strings beginning from the leading end of the character string to be coded, sufficient compression rate should not be achieved.