The present application relates to a coding apparatus and a decoding apparatus, which are based on LZSS (Lempel-Ziv-Storer-Szymanski) codes. More particularly, the present application relates to a coding apparatus and a decoding apparatus, which dynamically change a relation between a matching length in a coding process and its code, and relates to a coding method adopted by the coding apparatus, a decoding method adopted by the decoding apparatus and a program which make a computer execute the coding method and decoding method.
The LZSS code is one of codes for a coding process based on a dictionary. The coding process based on LZSS codes includes a reversible data compression process as disclosed in “Introduction to Algorithms for Compressing Text Data” authored by Tomohiko Uematsu and published by CQ Publishing on Oct. 15, 1994, pp. 131 to 138. In a coding process based on LZSS codes, input data to be coded is delimited into symbols each having a fixed length of M bits. An example of such a symbol is a character. As a matter of fact, such a symbol is explained uniformly as a character in the following description. Thus, a character is taken as the smallest unit. Data is perceived as a long string of characters. A portion included in the string of characters as a sequence of characters is handled as a sub-string of characters. The entire string of characters is disassembled into a plurality of character sub-strings and a CODE code is assigned to each of the character sub-strings. There are two types of CODE code assigned to each sub-string of characters. One of the types of code is a PTR code obtained as a result of a coding process carried out by referencing a matching sub-string of characters in already coded data Qenc. The other type of code is a RAW code, which is the original character itself. In addition, a FLG sub-code having a length of 1 bit is provided as a flag for indicating whether the code type is PTR or RAW. The FLG sub-code and the CODE code form a pair of codes. This pair of codes is the code obtained as a result of a process to code a sub-string of characters.
The lengths of character sub-strings are confirmed sequentially starting with the first character of the input data in order to gradually carry forward a disassembly process to separate the sub-strings of characters from each other. The character sub-strings obtained as a result of a disassembly process do not include overlapping portions. The disassembly process to separate the sub-strings of characters from each other is carried out in such a way that, when one sub-string of characters is separated from the remaining sub-strings of characters, a character following immediately the tail character of the separated sub-string of characters becomes the head character of the character sub-string following the separated sub-string of characters. The sub-strings of characters separated from each other in a disassembly process are then coded sequentially. Prior to the disassembly process, only the head character H of a character sub-string is confirmed but the length is indeterminate. After the input data is disassembled into sub-strings of characters in accordance with the following procedure, however, the length of each character sub-string is determined.
First of all, already coded data Qenc is searched for a character matching the head character H of a character sub-string s to be separated in a disassembly process. The already coded data Qenc is a string of characters equal to input data starting from the head character of the input data and ending at the tail character of an already coded sub-string of characters. A range Qewin determined in advance has been set in the already coded data Qenc at a position relative to the character sub-string s to be separated in the disassembly process. The already coded data Qenc can be compared with only characters of the range Qewin. The range Qewin is also referred to as a slide window, a slide dictionary or another name.
When the range Qewin of the already coded data Qenc is searched for characters each matching the head character H of a character sub-string s to be separated in the disassembly process and at least one character is found in the search process, all character sub-strings each having the found character as its head character are each taken as an object of comparison with the character sub-string s. The comparison is carried out by gradually increasing the length of the character sub-string s to search for matching ones with a maximum length. Then, the matching character sub-string with the maximum length in the range Qewin is referred to as the longest matching character sub-string mstr. The length of the longest matching character sub-string mstr is referred to as mlen. The head character of the matching character sub-string mstr in the range Qewin is identified by its position relative to the head character H. If this position is represented by NP bits, the range Qewin can be used for storing up to 2NP characters where notation 2NP denotes a value equal to the NPth power of 2. That is to say, N is the upper limit of the number of characters that can be stored in the range Qewin.
In a process to code LZSS codes, the maximum matching length mlen is compared with a predetermined threshold value PTH. First of all, let us consider a case in which the maximum matching length mlen is greater than the predetermined threshold value PTH. In this case, if the maximum matching length mlen is not greater than a maximum length lmax that can be expressed by a matching-length code, the maximum matching length mlen is set in the matching length len. If the maximum matching length mlen is greater than the maximum length lmax, on the other hand, the maximum length lmax is set in the matching length len. If the matching length len is set in this way, the character sub-string s having the matching length len and the head character H as its head character is separated in a disassembly process and coded to generate (NP+NC) bits as a PTR code, which is a combination of a code p and a code c. To put it in detail, the code p consisting of NP bits is a code representing a number showing the position mpos of the head character nH of the longest matching character sub-string mstr in the range Qewin. On the other hand, the code c consisting of NC bits is a code representing the matching length len. In this case, the value of the FLG sub-code for the PTR code is 0.
Let us assume that the threshold value PTH is 2 and a sub-string of three characters ‘fgh’ in a slide window 111 of a data buffer 110 is detected in a search process as a character sub-string matching a character sub-string immediately following the slide window 111 as shown in FIG. 43A. In this case, a matching length defined as the length of the character sub-string matching a character sub-string immediately following the slide window 111 is three characters. The head character of the sub-string of three characters ‘fgh’ in a slide window 111 is the character ‘f’ and the relative position of the head character in the slide window 111 is a position of four. Thus, the PTR code is (4, 3). The FLG sub-code is set at 0 indicating that the code obtained as a result of the coding process is a PTR code.
If the maximum matching length mlen is not greater than the threshold value PTH or there is no character sub-string matching already coded data, on the other hand, only the head character H of the character sub-string is subjected to a disassembly process and the head character H is used as a RAW code having a length of M bits as it is. In this case, the FLG sub-code is set at 1 indicating that the code obtained as a result of the coding process is a RAW code.
For example, there is no character sub-string included in the slide window 111 as a character sub-string matching a head character ‘k’ immediately following the slide window 111 as shown in FIG. 43B. In this case, the head character ‘k’ is output as a RAW code and the FLG sub-code is set at one indicating that the code obtained as a result of the coding process is a RAW code.
In an LZSS-code decoding process, on the other hand, all character sub-strings corresponding to input codes starting with the first one and ending with the last one in the same order as codes generated in a coding process are decoded. A character sub-string obtained as a result of the decoding process is concatenated to the tail of already decoded data Qdec as additional Qdec. In this way, the original data generated by the decoding process is obtained as a character string that becomes longer gradually. Much like the data Qenc obtained as a result of a coding process, the data Qdec obtained as a result of a decoding process is referenced by using a number indicating a position relative to a character sub-string s serving as a decoding object of the decoding process. A FLG sub-code of 0 in the input code indicates that the CODE code of the input code is a PTR code. On the other hand, a FLG sub-code of 1 in the input code indicates that the CODE code of the input code is a RAW code. In the case of a RAW code, a character string consisting of only one character is concatenated to the tail of already decoded data Qdec as the CODE code. In the case of a PTR code, on the other hand, a code p is decoded to generate the position of the head character of a matching sub-string of characters and a code c is decoded to generate the matching length of the sub-string of characters. The position and the matching length are used to determine the sub-string of characters from the already decoded data Qdec. Then, the determined sub-string of characters is copied character by character starting with the head character and a result of the copy process is concatenated to the already decoded data Qdec. In this way, a sub-string of characters is obtained as a result of a process to decode CODE codes. By copying the determined sub-string of characters one character after another one starting with the head character and concatenating a result of the copy process to the already decoded data Qdec as described above, the copy process can be carried out correctly even if the determined sub-string of characters partially or wholly overlaps the character string being decoded. The matching length of the matching sub-string of characters changes from the value of the expression (PTH+1) to the value of the expression (the NCth power of 2+PTH) where notation NC denotes the number of bits representing the matching-length code c.
As described above, a PTR code for an LZSS code is a code including a number representing the position mpos of the head character of a matching character sub-string in a data buffer and the length len of the matching character sub-string. Let us assume that we consider a case in which the length len is associated with a code having a fixed bit count NC on a 1-to-1 basis. In this case, if the bit count NC is small, only few limited lengths len can be associated with a code having the fixed bit count NC. If the bit count NC is large, on the other hand, a large number of lengths len can be associated with a code having the fixed bit count NC. However, the use of a code having a small possible bit count NC to represent information provides a higher compression efficiency than the use of a code having a large bit count NC to represent the same information.
As is generally known, it is nice to provide a search range Qewin with a size of about 8,000 characters as a search range of already coded data. For more information on the search range, the reader is suggested to refer to a document such as non-Patent Document 1 described earlier. The bit count NP of the aforementioned position mpos is determined from the size of the search range Qewin. If the size of the search range Qewin is 4,092 characters, for example, a bit count NP of 12 bits can be used for expressing the aforementioned position mpos. However, a sub-string of characters to be disassembled into large lengths such as 1,000 characters does not appear frequently. Rather, the frequency of disassembling a sub-string of characters into small lengths is high. Therefore, the bit count of the length len is set at a value smaller than the bit count NP of the position mpos in many cases. Thus, in the case of a search range Qewin with a size of 4,092 characters, let us assume that a character sub-string with a length of 1,000 characters is found. Even in this case, it is possible to disassemble character sub-strings up to a character sub-string having a length equal to a maximum value limited by the bit count of the length len as the maximum value of the matching length. Let us assume for example that the bit count NC of the matching-length code is 4 and the threshold value PTH is 2. In this case, 16 different lengths, i.e., the lengths of 3 to 18, can be expressed by the matching-length code. Thus, even if the maximum value of the matching length is 1,000 characters, in the end, a string of characters is coded by disassembling the string of characters into character sub-strings each having a length not exceeding 18 characters.
As a method to get rid of this waste, an escape code showing an extension of the length is assigned to one of the 16 matching-length codes and, after a process to decode this escape code, another fixed bit count is further fetched. In this way, it is possible to adopt a conceivable method of using a code having a variable bit count, which is increased in stages. Even with this method, in order to carry out extension operations to produce a long character string such as a string having a length or 1,000 characters, the codes must be subjected to extension operations at several stages using several escape codes. Thus, this method raises problems that a short code cannot be assigned either and the processing becomes complicated.