1. Field of the Invention
The present invention relates to both a character string retrieval apparatus and a method for registering a plurality of character strings, such as chinese characters, etc. in an array in advance regarding a character string retrieval, and judging whether or not a given character string is registered.
The present invention also relates to both a character code registration retrieval apparatus and a method regarding a key retrieval technology, in particular, for registering character strings, such as Kanji codes being a target to be retrieved using keys in a double array structure being an one-dimensional array of a data structure.
2. Description of the Related Art
Recently, as computer networks, electronic mail, etc. have become widespread, the amount of electronic documents (digital documents) possessed by individuals has rapidly increased. For example, a lot of people receive and process several hundreds to one thousand electronic mails a day. It is not rare that 1 mega-byte (MB) of document data are stored in a day and several hundred mega-bytes to one giga-byte (GB) in a year.
To handle such a large amount of data, it is necessary to reduce the necessary memory capacity and to speed up the transmission of data by omitting redundancy in data and compressing the data amount. The data compression technology has been made indispensable due to the recent trends described above, and for compressing a variety of data by one method, for example, a universal encoding has been proposed.
However, when document data, such as electronicized Japanese, Chinese, etc. are compressed in units of words, first, it is necessary to judge at high speed whether or not a character string inputted from a document is a word registered in a dictionary in advance. Furthermore, since in these languages there are a lot of words to be registered in a dictionary, the dictionary has to be edited in such a way as a useless memory area may not be generated as much as possible. In a well-known TRIE method, a plurality of words being a key are stored in a TRIE dictionary of a tree structure, and a word included in an input character string is retrieved by collating the character string with each node of the tree structure, character by character.
In the following description, names used in an information theory are used as they are, that is, data in one word unit are called a symbol or character, and an arbitrary number of connected data are called a string or character string. Furthermore, a sequence consisting of several leading symbols and characters in a code string or character string is called a prefix, and a sequence consisting of several ending symbols and characters is called a suffix. For example, the prefixes of a character string abc are xcex5 (empty), a, ab and abc, and the suffixes are xcex5, a, ab and abc.
In the compression of language codes it is important to store a string, such as a word, etc. in a data structure with a memory capacity as small as possible, and develop an algorithm to retrieve the string at high speed. In particular, in the case of a dictionary storing words, key aggregates to be registered are known in advance, and the dictionary is often expanded by suitably adding keys later. Therefore, it is also important that keys can be easily added. Such a data structure is called a quasi-static data structure.
Aoe has proposed a double-array as a data structure for pattern-matching a plurality of keys at high speed (Junichi Aoe: xe2x80x9cA High-speed Digital Retrieval Algorithm by Double-arrayxe2x80x9d, in Proceedings of Papers D of The Electronics Information and Communications Institute, Vol.J71-D, No.9, pp.1,592-1600, 1988).
FIG. 1A shows an example of a double-array. This double-array comprises two one-dimensional arrays of BASE and CHECK, and data stored by these arrays corresponds to a TRIE structure shown in FIG. 1B. The TRIE of FIG. 1B indicates the five English words of baby #, bachelor #, badger #, badge # and jar #, and the index of each node corresponds to the subscripts of the arrays of BASE and CHECK shown in FIG. 1B. A position where the registration values of BASE and CHECK are both 0, corresponds to a space position where nodes are not yet registered.
This TRIE includes a repeat of the parental relation of nodes shown in FIG. 1C, and the index n of a parent node and the index m of a child node correspond to the subscripts of a BASE and a CHECK, respectively. In other words, this parental relation indicates a kind of state transition, and when a character a is inputted in the state of a parent node n, the transition from the state of a parent node n to the state of child node m is made.
When the index of a child node corresponding to the character a following the parent node n is retrieved using a double-array, first, as shown in FIG. 1D, a position corresponding to the subscript n on a BASE is referred to and the content d is obtained. This value d indicates a kind of origin shift amount (displacement amount) for the subscript of the CHECK.
Then, the subscript of a position shifted by the internal representation value of the character a, with the subscript d on the CHECK as a start point, is assumed to be m (=d+the internal representation value of character a). If the content of a position corresponding to the subscript m on the CHECK coincides with the index n of the parent node, the character a is stored below the node n, and it is found that the subscript of a corresponding child node is m. At this time, the index m of the child node is expressed as m=g(n,a) using a goto function g specifying a state transition for a key on a TRIE.
Generally speaking, one or more child nodes are following one parent node, and in a normal TRIE structure, the retrieval speed of a child node is reduced according to the number of the sibling nodes following the same parent node. On the other hand, in the double-array TRIE structure, a high-speed retrieval is available regardless of the number of sibling nodes.
However, the conventional character string retrieval described above has the following problems.
When a double-array is used for a Kanji dictionary of Japanese, Chinese, etc., the number of child nodes following one parent node tends to increase compared with an alphabetical dictionary of English, etc. due to the variety of Kanji idioms.
FIG. 1E shows a case where five Kanji idioms starting with a Kanji xe2x80x9cxe2x80x9d (electricity), that is, xe2x80x9cxe2x80x9d (voltage), xe2x80x9cxe2x80x9d (electricity), xe2x80x9cxe2x80x9d (electric train), xe2x80x9cxe2x80x9d (computer) and xe2x80x9cxe2x80x9d (telephone) are registered in a double-array. In this case, a Kanji code value corresponds to each of the characters following xe2x80x9cxe2x80x9d, that is, xe2x80x9cxe2x80x9d (pressure), xe2x80x9cxe2x80x9d (atmosphere), xe2x80x9cxe2x80x9d (train), xe2x80x9cxe2x80x9d (brain) and xe2x80x9cxe2x80x9d (speech), and a relative positional relation is kept constant on a CHECK according to the internal representation values. On the other hand, positions marked with O on the CHECK are already occupied by other Kanji characters, and the respective Kanji following xe2x80x9cxe2x80x9d cannot be necessarily simultaneously matched for an empty position.
Therefore, in order to register these Kanji characters on the CHECK with the relative positional relation maintained, as shown in FIG. 1F, it is necessary to expand both arrays of BASE and CHECK. In this case, the minimum displacement amount (parallel shift amount) d which can accommodate all these Kanji characters is calculated, and this value d is written in a position of the code value n of xe2x80x9cxe2x80x9d on the BASE. Here, values obtained by adding the internal representation value of each of the Kanji characters following xe2x80x9cxe2x80x9d to this displacement amount d are designated for new subscripts of the array, p, q, r, s and t. Then, the index n of the parent node of xe2x80x9cxe2x80x9d is written in the positions of p, q, r, s and t on the CHECK.
FIG. 1G shows this TRIE tree structure. In FIG. 1G, xe2x80x9cxe2x80x9d is registered below the root node, and xe2x80x9cxe2x80x9d, xe2x80x9cxe2x80x9d, xe2x80x9cxe2x80x9d, xe2x80x9cxe2x80x9d and xe2x80x9cxe2x80x9d are registered below the node n corresponding to nodes p, q, r, s and t, respectively. Here, n=g(root, ), p=g (n, ), q=g(n, ), r=g(n, ), s=g(n, ) and t=g(n, ).
Here, the problem is that unlike in the case of letters of the alphabet, in the case of Kanji characters a lot of characters follow one character, and if these characters are registered in the CHECK with the relative positional relation maintained, an array often has to be expanded. If the array is expanded, spaces between characters already registered are left unoccupied and empty. If this expansion of the array is repeated, the number of such spaces remarkably increases. Therefore, it is very difficult to store a lot of Kanji idioms in a small memory capacity.
It is an object of the present invention to provide a string retrieval apparatus and method for reducing the number of idle spaces of a dictionary without losing the high-speed of retrieval, and retrieving a string using a compressed dictionary.
It is another object of the present invention to provide a character code registration retrieval apparatus and method for registering a lot of character codes with the expansion of an array suppressed as much as possible by proposing a new data structure obtained by further developing a double-array structure being a conventional high-speed low-capacity dictionary data structure, and introducing a new data structure different from the conventional double-array regarding frequently-appearing character codes.
In the first aspect of the present invention, the string retrieval apparatus comprises a first array unit, a second array unit, a third array unit and a retrieval unit, and retrieves a given string out of registration strings.
The first array unit registers number information corresponding to a prefix, at the position of a subscript, which is identical to the index of the prefix followed by a plurality of characters. The second array unit registers a displacement amount corresponding to each of a plurality of groups obtained by classifying the plurality of characters following the prefix, at the position based both on a subscript identical to the number information corresponding to the prefix, and another subscript concerning to a character code. The third array unit registers the index of the prefix, at the position of a subscript identical to the sum of the displacement amount and the internal representation value of a character following the prefix. The retrieval unit retrieves a given string using the first, second and third array units.
By adopting such a string retrieval apparatus, characters following a prefix are classified into a plurality of groups, and a displacement amount is assigned to each group. Since the number of characters included in each group is less than the total number of characters following the prefix, the empty positions in the array unit can be easily utilized as compared with a case where all the characters are registered at one time. Thus, characters can be registered with a smaller displacement amount, and the expansion of both the first and the third array unit can be suppressed with the high-speed of retrieval maintained.
In the second aspect of the present invention, the string retrieval apparatus comprises a register unit and a retrieval unit, and retrieves a given string out of the registration strings. The register unit classifies and registers a plurality of characters following a prefix, and the retrieval unit retrieves the given string using the register unit.
By adopting such a string retrieval apparatus, like the first aspect, empty areas in the register unit can be efficiently utilized, and the data structure of the registered string can be compressed with the high-speed of retrieval maintained.
An apparatus in the third aspect of the present invention is a character code registration retrieval apparatus for registering character code strings to be retrieved using keys, in a double-array structure being a one-dimensional array of a data structure, and retrieving a string, and comprises a parallel shift amount calculator unit for calculating a parallel shift amount needed to register the characters of each string to be retrieved using keys, a first array unit having an index of a prefix of each character string to be retrieved using keys as a subscript, an identifying unit for judging a registration value in the first array unit, a second array unit registering information on a specific character following the prefix of a string indicated in the first array unit, a key candidate point calculator unit for calculating the sum of the parallel shift amount registered in the first and second array units and as internal representation value corresponding to a character following the prefix of the string and a third array unit registering the index of the prefix of the string, with the sum obtained by the key candidate point calculator unit as a subscript.
By introducing as a new data structure obtained by further developing a double-array structure being a one-dimensional array of a conventional high-speed low-capacity dictionary data structure, a new data structure having a first array with the index of the prefix of each string to be retrieved using keys as a subscript, a second array registering information on specific characters following the prefix of the string shown in the first array and a third array registering the index of the prefix of the string using as a subscript the sum of a parallel shift amount needed to register the character of each string to be retrieved using keys calculated by a parallel shift amount calculator unit in the first and second arrays and as internal representation value corresponding to a character following the prefix of the string, such a character code registration retrieval apparatus can provide each character code with a registration position in such a way as character codes may be overlapped with each other on the CHECK array corresponding to the third array. As a result, all the character codes as keys can be registered in spaces on the CHECK array at one time, with the expansion of the CHECK array suppressed as much as possible, all the character codes as keys can be registered on the CHECK array, with the relative positional relation between character codes following a certain character code maintained and with the expansion of the CHECK array suppressed as much as possible, and further the occurrence of idle spaces (sparse areas) can be reduced to the lowest possible level. Thus, a dictionary storing a quasi-static key aggregate, that is, an aggregate of predetermined keys, as retrieval targets can be generated and thereby the memory capacity having a TRIE array structure which can be expanded by properly adding and registering keys later, can be minimized.
An apparatus in the fourth aspect of the present invention is a character code registration retrieval apparatus in the third aspect, and comprises a list unit for generating a list of character codes frequently used in idioms and outputting a character code selected from the list of the character codes, a frequently-appearing character code selector unit for outputting a frequency threshold on up to what frequency order number of character codes should be selected, a frequently-appearing character code storage unit for storing a frequently-appearing character code selected from the list unit and outputting the selected frequently-appearing character code and the index of the frequently-appearing character code, a dictionary unit being a character code dictionary registering idioms composed of character codes, for classifying a job according to whether or not a focused character is the prefix of idioms based on the frequently-appearing character and outputting each of groups obtained by classifying character following the frequently-appearing character of the prefix, a group storage unit for storing each of groups obtained by classifying character following the frequently-appearing character of the prefix inputted by the dictionary unit, a first BASE array unit as the first array unit, for calculating number information of the frequently-appearing character and storing the number information in the position of the index of the internal representation value on the first BASE array, a code classification unit, to classify characters following the frequently-appearing character of the prefix, for classifying the second character of the idiom using several bits of the second character code, a parallel shift amount calculator unit for calculating a minimum parallel shift amount such as any value obtained by adding an arbitrary parallel shift amount to the internal representation value of each character in each group may indicate an empty position on a CHECK array, a parallel shift amount storage unit for storing the parallel shift amount inputted from the parallel shift amount calculator unit and outputting the parallel shift amount to a second BASE array unit, a key candidate point calculator unit for registering the index of the prefix being the parent of the characters at the position of a subscript in the CHECK array, which is identical to the sum of the internal representation value of each character of the group and the parallel shift amount, and designating the value of the sum for the index of a next prefix consisting of (prefix+current character), the second BASE array unit as the second array unit for storing the parallel shift amount for each group outputted by the parallel shift storage unit based on both the code value inputted by the code classification unit and the number information inputted by the list unit, and a CHECK array unit as the third array unit for registering the index of the prefix in a position corresponding to the value of the sum.
In such a character code registration retrieval apparatus, each character code can be provided with a registration position in such a way as character codes may be overlapped with each other on a CHECK array by introducing, as a new data structure-obtained by further developing a double-array structure being a one-dimensional array of a conventional high-speed low-capacity dictionary data structure, a new data structure having a CHECK array unit as a first array unit for registering the index of a prefix in a place of subscript corresponding to the sum of a parallel shift amount and the internal representation value of a character code, a first BASE array unit for calculating the number information of a selected character and simultaneously storing the number information in the position of the index of the character on the first BASE array, and a second BASE array for storing the parallel shift amount of each group inputted by a parallel shift amount storage unit based on both the code value outputted from a code classification unit and the number information outputted by a list unit, by generating two kinds of values to be registered in the first BASE array and applying two kinds of values that is a conventional parallel shift amount (with a low use frequency) and one of the subscripts of the second BASE array, to character not frequently used and a frequently-appearing character respectively, and classifying the subscripts of the second BASE array into three groups according to the code values of characters following the frequently-appearing character code and providing each group with a unique parallel shift amount. As a result, all the character codes as keys can be registered in spaces on the CHECK array at one time, with the expansion of the CHECK array suppressed as much as possible, each character code can be registered in the CHECK array, with the relative positional relation between character following a certain character maintained and with the expansion of the CHECK array suppressed as much as possible, and further the occurrence of idle spaces can be reduced to the lowest possible level. Thus, a dictionary storing a quasi-static key aggregate, that is, an aggregate of predetermined keys, as retrieval targets can be generated and thereby the memory capacity with a TRIE array structure which can be expanded by properly adding and registering keys later, can be minimized.
An apparatus in the fifth aspect of the present invention is a character code registration retrieval apparatus in the third aspect, and comprises a document input unit for first designating the root of a TRIE structure for a prefix, and simultaneously setting an end mark in the prefix, then instructing to input a character code of a character to be retrieved and detecting the prefix of the input character code, a first BASE array unit for outputting a numeric value from a place corresponding to the index of the prefix or the character code, a registration value judgement unit for judging whether the numeric value inputted from the first BASE array unit is the number information of the prefix character or a parallel shift amount, outputting the numeric value as the number information of the prefix character code when the numeric value is out of the scope of an index composing a TRIE, and outputting the numeric value as a parallel shift amount when the numeric value is within the scope of the index, a code classification unit for classifying the input character code using several bits of the character code, when the numeric value inputted from the first BASE array unit is the number information of a frequently-appearing prefix character code, a second BASE array unit for outputting a parallel shift amount from a place corresponding to both the number information of the prefix outputted from the registration value judgement unit and the classification of the character code, a parallel shift amount storage unit, when the numeric value inputted from the first BASE array unit is a parallel shift amount, for storing the parallel shift amount, a key candidate point calculator unit for calculating the sum of the parallel shift amount and the internal representation value of the input character, a CHECK array unit for outputting a key from a place corresponding to the sum calculated by the key candidate point calculator unit, and a key/prefix collation unit for judging whether or not the key inputted by the CHECK array unit coincides with the index of the prefix character code or the index of the prefix, and when the key coincides with the index of the prefix character code or the index of prefix, judging that the idiom is registered in the dictionary.
In such a character code registration retrieval apparatus, a dictionary storing a quasi-static key aggregate, that is, an aggregate of predetermined keys, as retrieval targets can be generated and thereby the memory capacity with a TRIE array structure which can be expanded by properly adding and registering keys later, can be minimized by introducing, as a new data structure obtained by further developing a double-array structure being a one-dimensional array of a conventional high-speed low-capacity dictionary data structure, a new data structure having a CHECK array unit for outputting a key from a place corresponding to the sum inputted from the key candidate point calculator unit, a first BASE array for outputting a numeric value from a place corresponding to the index of a prefix or character code, and a second BASE array for outputting a parallel shift amount from a place corresponding to both the number information of the prefix character code outputted from the registration value judgement unit and the classification of the character code. As a result, a high-speed pattern matching can be implemented by storing data in a double-array structure (that is, a TRIE array structure) being an one-dimensional array with a memory capacity reduced to the lowest possible level and using this TRIE array structure as a retrieval key.