In recent years, with various types of data, such as character code and image data, being handled by computers, the amount of data to be handled has been increasing. When such a large amount of data is handled, compression is performed with a redundant portion in the data being omitted, thereby allowing a reduction in capacity of storage required and high-speed transmission to a remote place.
Here, the present invention can be applied not only to compression of a character code but also to compression of various data. In the following description, based on the information theory, data of one word obtained by dividing a data string in units of words is referred to as a character, and a data string having an arbitrary number of words is referred to as a character string.
Conventional data compression technologies include dictionary coding using similarity of data series and statistical coding using a frequency of appearance of a data string. Of these, as typical schemes of the former dictionary coding, LZ77 coding and LZ78 coding have been known (Tomohiko Uematsu, “An introduction to document data compression algorithm”, CQ Publishing, pp. 131-208, 1995).
By comparison between LZ77 coding and LZ78 coding, LZ77 coding can achieve a sufficient compression ratio with a simple process, and therefore has become mainstream in practical use.
In LZ77 coding, as shown in FIG. 1, a slide buffer 200 having a certain size is provided. In this buffer 200, a character string having the longest match with an input character string is searched for, and with the use of the position and length of the character string, the input character string is coded. Since the buffer 200 is slid as coding proceeds, this coding scheme is also called a slide dictionary scheme. In FIG. 1, when an input character string “abcdaaaq” at the right of the buffer 200 is coded, the longest character string that matches therewith in the buffer 200 is “abcd”. Thus, with a relative address “5 (bytes)” between the head position of the longest-match character string and the head position of the input character string being taken as a match position and the length of the longest-match character string “4 (bytes)” being taken as a match length, a code, such as (match position, match length)=(5, 4), is generated. With this, the head “abcd” of the input character string is replaced by (5, 4). Similarly, the next character string “aaa” is replaced by (13, 3). However, the slide buffer in practical use is much longer, and when character strings in the buffer are sequentially searched in order to find a character string that has a longest match, an enormous amount of time is required. Therefore, in practice, in stead of checking all character strings in the buffer, a position where a prefix unit (the order of two to four characters) of a character string is registered in a table as required, and only the characters string whose positions are retained in the table are checked. Examples of the table for use in such a search include a Look Up Table (LUT) and a Hash Table.
FIG. 2 shows a character string search using a LUT. A LUT 202 has stored therein a position (address or pointer) of appearance of a character string in the buffer 200 with a prefix unit of the character strings in the buffer 200 being taken as an address. At the time of a search, with a prefix unit of the input character string being taken as an address, an area of the LUT 202 is accessed, thereby acquiring a position of the corresponding character string. If plural character strings having the same prefix unit are present in the buffer 100, as shown in FIG. 3, plural positions of appearance are retained in a form of a linked list 204. Thus, by accessing the LUT 202 only once, the positions of all corresponding character strings in the buffer 200 can be acquired. Here, a prefix unit of two characters is used, and an area of the LUT 202 corresponding to the prefix unit “ab” of the input character string retains two positions of appearance by using the linked list 204.
As such, in the LUT, the prefix unit to be searched for is caused to have a one-to-one correspondence with the area of the table, and referring to the table only once achieves an acquisition of necessary information, thereby allowing an extremely high-speed search. However, when a long character string is searched for, the number of required areas in the table is increased with a power of the width of the number of characters that can appear, thereby requiring an enormous amount of areas. For example, when the number of characters that can appear is 256 of 8 bits, the number of areas required for prefix units of n characters is the n-th power of 256. However, if the prefix unit to be searched for is made longer, only a part of the areas provided is for actual usage (registration), and the inside of the table is in a sparse state. Thus, if the prefix unit to be searched for is made longer, efficiency in memory use is deteriorated. To get around this, in a hash table, when a search character string is used as an address, the character string is degenerated to no more than a certain numerical value to cause a plurality of character strings to share one area. Thus, after a table search, a check is required as to whether the acquired character string is really a character string to be searched for. In comparison with the LUT, however, a longer character string can be searched for in the equivalent table area.
FIG. 4 shows a character string search using a hash table. A hash code generating unit 206 generates a hash code 208 from the prefix unit “abc” of the input character string, and use the hash code as an address to access a hash table 210. In the hash table 210, a position in the buffer 200 corresponding to the hash code 208 is stored. By checking a character string “abcde” at that position against the input character string, it is checked whether both prefix units match with each other. Then, if they match with each other, it is determined that the character string matching with the input character sting is present in the buffer 200. As with the LUT, in the hash table, for plural character strings having the same prefix unit in the buffer, plural positions of appearance are retained in a form of a linked list. In either case, the linked list is used for searching for the longest-match character string.
However, such conventional data compression technologies include the following problems. First, when a LUT is used to search for a long character string, even if a table having an enormous area is used, only a part thereof is used, thereby causing the inside of the table to be in a sparse state. Although the hash table has a small table size compared with the LUT, the inside of the table is similarly in a sparse state if the input data is few. This poses a problem in which the memory is not necessarily used effectively. Moreover, when the longest-match character string is searched for, the plural positions of appearances retained in the linked list have to be traced one by one. This poses another problem in which, if the number of character strings having the same prefix unit is increased, it takes time to perform a search process.
To solve these problems, the inventors of the present invention have suggested a data compression method capable of performing a search with a less amount of memory in proportion to the amount of input data (Japanese Patent Application No. 2000-98834). This method is to provide an input buffer and create a search table for the input buffer at one time, instead of a conventional scheme of sequential registration in a search table while coding proceeds. For a search, a rank list is used in which character strings starting at respective addresses in the input buffer are sorted according to the contents of the character strings. Among others, a scheme of generating a recent match position list from the rank list and detecting from the recent match position list a portion where the same numbers are successively present to find a match can be implemented with the least amount of memory.
FIGS. 5A-5D show specific examples of the input buffer, the rank list, and the recent match position list for use in the method suggested by the inventors of the present invention. This method is processed in the following procedure.
(Data Input and List Generation)
In an input buffer 212 of FIG. 5A, data of a buffer size is input, a coding-target position address t is initialized as t=1, and then a rank list 214 of FIG. 5B and a recent match position list 216 of FIG. 5C are created. Here, the rank list 214 is created by sorting three-character strings starting at each address in the input buffer 212 in the order of a numerical value. Also, the recent match position list 216 has stored therein a relative position of the most-recently appearing address. For example, a character string “com” from an address 15 has most recently appeared at an address 1 and a relative position 14. Therefore, the relative position 14 is stored in the address 15 in the recent match position list 216. Here, in Japanese Patent Application No. 2000-98834, the address itself is stored in the recent match position list. In this case, the address 1 is stored at the address 15 in the recent match position list 216.
(Detection and Coding of a Matching Character String)
A matching character string is detected from a portion where the same numbers are successively present in the recent match position list 216. Referring to the recent match position list 216 in FIG. 5D, numbers 14 are successively present at addresses 15 to 20, numbers 9 are successively present at addresses 24 to 29, and numbers 23 are successively present at addresses 30 and 31. First, the numbers 14 successively present at the addresses 15 to 20 match with a character string from an address 15−14=1, a match length is 6+2=8, and a match position is 14. Thus, (match length, position)=(8, 14) is generated as a code. Also, the numbers 9 and 23 successively present at the addresses 24 to 29 and 30 to 32, respectively, match with a character string from an address 24−23=1, a match length is 9+2=11, and a match position is 23. Thus, (match length, position)=(11, 23) is generated as a code.
However, in the data compression method shown in FIGS. 5A-5D in which portions where the same numbers are successively present is detected from the recent match position list, the longest match cannot be detected for data, as shown in an input buffer 112 of FIG. 6A, such that a repetition of a long character string contains a repetition of a short character string forming the long character string. That is, in the input buffer 212, between long character strings “abcdef”, short character strings “abc” and “cde” are repeated from addresses 7, 10, and 13. In a recent match position list 216 of FIG. 6B generated from the data of the input buffer 212, no portion where the same numbers are successively present is present, thereby posing a problem in which a repetition of the character string “abcdef” cannot be detected.