1. Field of the Invention
The present invention relates to an information retrieval technology of retrieving variable-length character string data. More particularly, the present invention relates to a technology to enhance efficiency in retrieving the longest prefix match or longest suffix match of a variable-length character string.
2. Description of the Related Art
First of all, a description will be given of a general outline of the retrieval of the longest prefix match, which is the primary application target of the present invention. In the retrieval of the longest prefix match, a retrieval result is, in a pattern list, the longest of patterns that match leading characters of a retrieval key (character string to retrieve). For example, when there are three patterns “ABCD”, “ABCDEFGH” and “ABCDE” that match leading characters of a retrieval key (e.g., “ABCDEFGHIJ”), the longest matching pattern “ABCDEFGH” is outputted as a retrieval result. At this time, if a pattern is longer than the retrieval key, it does not matter what character string the pattern has in the part exceeding the pattern of the retrieval key. On the other hand, a pattern that does not match leading characters of the retrieval key, such as a pattern “BCDE”, does not meet a condition of prefix matching, even if the pattern is a partial character string of the retrieval key.
In a system performing retrieval of information concerning variable-length character string data, in particular, retrieval of a prefix match or suffix match of a variable-length character string, a technology as described below is conventionally known as a method for fast retrieval of a pattern that matches a retrieval key among a large number of patterns.
Japanese Unexamined Patent Application Publication No. H04-209069, for example, describes a prior art concerning the retrieval of a prefix match, where index creating means and data retrieving means are provided. The index creating means creates an index table based on first n characters (n: natural number) of character string data. The data retrieving means searches the index table to extract character strings whose prefixes match a retrieval condition. In the data retrieving means, when a character string designated as the retrieval condition is longer than the character strings of index data, each character in the remaining parts of the extracted character string data is compared with the retrieval condition, thereby retrieving a character string that matches the retrieval condition.
Such a prior art has a disadvantage, which will be described with reference to FIG. 1. In the prior art, the index data are created from first n characters of the character string data. In the case of searching a list where only patterns 1301 to 1304 are registered, five leading characters (to the left of a separation 1351) of each of the patterns 1301 to 1304 are set as an index because four leading characters thereof are common. Thus, the registered patterns can be efficiently narrowed based on the indexes. Similarly, in the case of searching patterns consisting of only patterns 1305 to 1310, ten leading characters (to the left of a separation 1352) of each pattern are set as an index because the patterns 1305 to 1310 have nine common leading characters.
A problem arises, however, in the case of a search target including both the patterns 1301 to 1304 and the patterns 1305 to 1301. Specifically, to efficiently narrow the patterns 1301 to 1304, it is desirable to set the five leading characters as an index, in which case, however, the patterns 1305 to 1310 will all have the same indexes. Therefore, if a character string starting with “PQRSPQRSP” is inputted as a retrieval key, the narrowing of the patterns 1305 to 1310 based on the indexes is insufficient, leading to increased costs of suffix-part comparison to be performed thereafter, lowering the retrieval efficiency. As described above, the prior art has a problem that, when the overlapping parts of registration patterns vary in length, the registration patterns are not narrowed sufficiently based on indexes, resulting in the costs of comparing the remaining parts of the character strings becoming large.