The present invention generally relates to dictionary searching systems, and more particularly to a dictionary searching system which is used for a universal coding such as an incremental parsing type Ziv-Lempel coding.
Recently, various kinds of data including character codes, vector information and image information have been processed in a computer, and the quantity of the data which is processed is rapidly increasing.
When storing and transmitting such a large quantity of data, it is desirable to compress the data quantity by omitting redundant parts of the data. For this reason, there is a demand to realize a method of efficiently compressing the data regardless of the kind of data.
Universal coding does not require a predetermined code table to be prepared in advance. Hence, universal coding may be applied to the compression of the various kinds of data described above.
In this specification, the data amounting to one word will be referred to as a "character", and the data corresponding to a plurality of consecutive words will be referred to as a "character string".
The Ziv-Lempel coding is a typical universal coding described in Seiji Munakata, "Ziv-Lempel Data Compression Algorithms", Information Processing, Vol. 26, No. 1, 1985, for example, and universal type and incremental parsing type algorithms have been proposed. In addition, an LZSS coding has also been proposed as a modification of the universal type algorithm and is described in Timothy C. Bell, "Better OPM/L Text Compression", IEEE Transactions on Communications, Vol.COM-34, No.12, Dec. 1986. Furthermore, an LZW coding has been proposed as a modification of the incremental parsing type algorithm and is described in Terry A. Welch, "A Technique for High-Performance Data Compression", Computer, Jun. 1984.
Out of the coding techniques described above, the LZW coding is employed in file compression for storage devices and the like because a high-speed processing can be carried out and the LZW coding algorithm is simple.
According to the incremental parsing type algorithm, an input character string is decomposed into string components which are respectively formed by adding an extension character to a partial string which is already registered in a dictionary. The input character string is coded by describing the string components by a reference number corresponding to one partial string which is registered in the dictionary and an extension character. The string component described above is registered in the dictionary as a new partial string and used for the coding process which is carried out thereafter.
According to the LZW coding, the extension character is added to the following partial strings. For the sake of convenience, it will be assumed that the input character string is made up of three kinds of characters "a", "b" and "c" and is "ababcbababaaaaa . . . " as shown in FIG. 1A. In this case, reference numbers "1", "2"and "3" are respectively assigned to the three kinds of characters "a", "b" and "c" and registered in the dictionary before starting the coding process.
First, the first character "a" of the input character string is read and this character "a" is retrieved from the dictionary. The reference number "1" which corresponds to this read character "a" is then regarded as a code .omega. corresponding to the observed (or uncoded) character string.
Thereafter, the second and subsequent characters of the input character string are read successively, and each character is regarded as an extension character K. A partial string (.omega.K) which is described as a combination (.omega.K) of the above described code .omega. and the extension character K is then retrieved from the dictionary. When the partial string (.omega.K) is retrieved from the dictionary, the reference number corresponding to the partial string (.omega.K) is regarded as a new code .omega., and the above described process is repeated by reading the next character of the input character string. Hereafter, the combination (.omega.K) will be referred to as a description of the partial string (.omega.K).
The character string which is to be coded is successively extended one character at a time in the above described manner, and the extended character string is successively retrieved from the dictionary. As a result, the longest partial string which matches the observed character string out of the input character string is retrieved from the dictionary by successively retrieving the registered character strings. The reference number which corresponds to the longest character string which matches the observed character string is output as the code .omega.. In addition, the partial string in which the extension word K is added to the partial string (.omega.) corresponding to the reference number .omega. is described by the combination (.omega.K) of the reference number .omega. and the extension character K, and this partial string is assigned a reference number and registered in the dictionary as a new partial string.
Therefore, the input character string shown in FIG. 1A is decomposed into the partial strings indicated by the underlines and coded into codes "1", "2", "4", . . . shown in FIG. 1B which correspond to the partial strings. FIG. 1C shows the corresponding relationship of the input character string and the partial strings registered in the dictionary. In addition, the following Table 1 shows an example of the dictionary which is formed.
TABLE 1 ______________________________________ Partial String Code ______________________________________ a 1 b 2 c 3 1b 4 2a 5 4c 6 3b 7 5b 8 8a 9 1a 10 ______________________________________
The dictionary which is formed during the LZW coding has a tree structure as shown in FIG. 2, and the elements of the dictionary correspond to nodes of the tree. In FIG.2, the number in brackets shown at each node indicates the reference number of the corresponding element of the dictionary.
If the registered elements of the dictionary are successively retrieved when retrieving the partial string during the coding process, the process would require too long a time. For this reason, the hashing method is used to retrieve the elements from the dictionary at a high speed.
According to the hashing method, a hashing function is defined using an element x of a set S which is made up of character strings. The hashing function is used to obtain an address of a storage location of the element x, and the element x is stored at the address which is obtained using the hashing function. The address which is obtained from the hashing function will be referred to as a hashing address.
For example, the reference number .omega. and the extension character K described above are described in binary numbers, and the combination (.omega.K) of these binary numbers is regarded as the hashing address. However, the dictionary in this case requires an extremely large memory capacity.
For this reason, an open hashing (or chaining) method was proposed. The open hashing method uses a part of the hashing address and forms lists of elements having a value which is identical to the part of the hashing address. According to this open hashing method, a retrieval part is retrieved using the hashing address as shown in FIG. 3 so as to indicate the corresponding list. In addition, each list stores identification information corresponding to each element and a pointer which indicates a storage location of a next element, thereby making it possible to successively retrieve the elements from the dictionary.
For example, if the reference number .omega. is regarded as the hashing address, a starting address of a list, which stores partial strings added with one extension character to the partial string which corresponds to the reference number .omega., is stored at this hashing address, and partial strings corresponding to nodes which are "children (or sons)" of the node corresponding to the reference number .omega. are successively stored in the corresponding list. In this case, the extension character K of each element is stored in the list as the corresponding identification information.
FIG. 4 shows a flow chart for explaining a coding operation using the open hashing method for retrieving the elements from the dictionary.
As described above, the dictionary is initialized to include at least the first character of the input character string, and a variable n is set to the reference number which is assigned to the partial string which is to be registered next. For example, the reference numbers "1", "2" and "3" respectively assigned to the characters "a", "b" and "c" are stored in the dictionary as hashing addresses, and a numerical value "4" is set for the variable n.
Next, a maximum number of partial strings which can be registered in the dictionary is denoted by N.sub.max, and sequences "Index", "List" and "Ext" respectively made up of N.sub.max components are defined. An initial value "0" is set to all the components of these sequences "Index", "List" and "Ext". The sequence "Index" corresponds to the retrieval part shown in FIG. 3, and the sequences "List" and "Ext" correspond to the lists. Accordingly, a number which indicates the component of the sequence "List" which becomes the head of the list corresponding to a node of the reference number i is set in the ith component "Index[i]" of the sequence "Index". In addition, the extension character K of the element of the dictionary indicated by the reference number i is set in the ith component "Ext[i]" of the sequence "Ext". A pointer which indicates the element corresponding to the "brother" of the element of the reference number i is set in the ith component "List[i]" of the sequence "List".
Next, the first character K is read, and the coding process is started by setting the reference number corresponding to this character K as the variable i.
First, a step 701 reads the next character K of the input character string, and a step 702 decides whether or not a next character to be read exists. When the decision result in the step 702 is YES, a retrieval process with respect to the dictionary starts.
In this case, a step 703 saves the variable i in another variable .omega. and sets a variable j to an initial value "0". A step 704 sets the number of the component of the sequence "List" which is indicated by the value of the component "Index[i]" corresponding to the variable i.
A step 705 decides whether or not the variable i is equal to "0". When the numerical value of the variable i is not "0" and the decision result in the step 705 is NO, a search process with respect to the corresponding list is started using the element included in this list as a candidate.
In this case, a step 706 decides whether or not the component "Ext[i]" which indicates the extension character of the corresponding element is equal to the extension character K. When the decision result in the step 706 is NO, a step 707 sets the pointer of the next element set in the component "List[i]" as the new variable i, and the process returns to the step 705. The search with respect to the corresponding list is made by repeating the steps 705 through 707 in the above described manner.
On the other hand, when the decision result in the step 706 is YES, it is decided that a partial string which matches the input character string is registered in the dictionary. Hence, in this case, the process returns to the step 701 to read the next character, and the character string added with this next character is coded.
When the value of the component "List[i]" or "Index[i]" corresponding to the variable i is "0", the decision result in the step S705 is YES.
When the value of the component "Index[i]" is "0", the element corresponding to the "child" of the node of the variable i is not yet registered, and it is indicated that the corresponding list is undefined. On the other hand, when the value of the component "List[i]" is "0", it is indicated that the desired partial string is not stored in the corresponding list.
In either case, the reference number which is saved in the variable .omega. in the step 703 indicates the partial string which is registered in the dictionary and makes a longest match with the input character string. A step 708 outputs a code corresponding to this reference number .omega., and carries out a registration process with respect to a new partial string.
First, a step 709 sets the value of the variable n in the variable i and increments the variable n. In addition, the step 709 sets the extension character K in the component "Ext[i]" corresponding to the variable i.
Next, a step 710 decides whether or not the value of the variable j is "0". When the decision result in the step 710 is YES, a step 711 sets the variable i in the component "Index[i]" and defines the list which corresponds to the reference number .omega.. On the other hand, when the decision result in the step 710 is NO, a step 712 sets the variable i in the component "List[j]" and adds a new "sibling (or brother)" to the corresponding list.
When the registration process described above ends, a step 713 sets the reference number corresponding to the extension character K in the variable i, and the process returns to the step 701 to repeat the process described above. The decision result in the step 702 becomes NO when there are no more characters to be read. When the decision result in the step 702 is NO, a step 714 outputs a code corresponding to the variable .omega. at this time, and the process ends.
During the list retrieval process of the conventional method described above, three processes are carried out successively. The three processes are the connection deciding process which decides whether or not a corresponding list exists and whether or not a next element exists in the list, the match detecting process which detects a candidate character which matches the input extension character, and the reading process which sets the next pointer and reads information from the dictionary. However, when the successive processing of the list is carried out by software, it takes time to retrieve the partial strings and the speed of the coding process becomes on the order of several tens of kb/s. For this reason, there is a problem in that the coding process cannot be carried out in real time to suit the data transfer speed to a magnetic tape unit or a magnetic disk unit because the data transfer speed is in the order of several hundred kb/s to several Mb/s.
On the other hand, if the data compression system is formed by use of independent elements for each step of the coding process described above, it becomes possible to carry out the coding process at a high speed but there are problems in that the scale of the circuit becomes large and the system becomes expensive.
For the sake of convenience, the conventional method was described above for the case where the character sequence to be coded is made up of three kinds of characters. However, in actual practice, the character string to be coded is made up of a large number of kinds of characters. Accordingly, during the normal dictionary retrieval process, it takes the longest time to find a list which corresponds to a certain reference number and to successively retrieve the elements corresponding to the "sibling" so as to detect that no matching or connecting element exists.