The present invention relates to an information search technique, and more specifically, it relates to a search technique that executes multiple character string analyses in parallel.
Along with recent developments in high-speed large-capacity communication infrastructures, such as those related to computers and the Internet, an enormous amount of information is generated and registered in an accessible form through a network. In response thereto, recently, there is an increasing demand for a search system that allows a user who accesses the information through the network and searches for target information to search for information including a document, an image, a music file, etc., as well as access the information.
In most of the search systems, target information is divided into unit segments (hereinafter referred to as “tokens”) such as characters, words, and sentences and then indexed. Further, an input search word or search string is also divided into predetermined unit segments (hereinafter referred to as “search tokens”) such as characters, words, or sentences. Whether to extract targeted information as a search result is determined based on whether tokens registered for the targeted information match with search tokens. At this time, it is necessary to generate tokens from a character string. Up to now, the token generating processing has been performed mainly using the following two methods.
A first one is a character string morphological analysis method. To describe how to generate tokens through morphological analysis, a character string is first segmented into unit words having a significant meaning, and the segmented words are registered as tokens. A second one is a so-called N-gram method. The N-gram method divides a character string by N characters in consideration of an overlap between N-character groups, and the N-character groups are registered as tokens.
According to the morphological analysis method, tokens are segmented or generated in units of words having a significant meaning by use of a dictionary. Therefore, the morphological analysis method enables high-quality search in consideration of the conjugation of each word with reference to a dictionary. On the other hand, the morphological analysis method is disadvantageous in that (i) any word not listed in a dictionary cannot be segmented, (ii) if erroneous word segmentation is carried out, even information including completely the same word as in a character string cannot be extracted as a search result, and (iii) maintenance of the dictionary is required.
In contrast, the N-gram method generates tokens by segmenting a character string in a mechanical manner. Therefore, the N-gram method can extract information including information including a completely matched character string as a search result. On the other hand, the N-gram method is disadvantageous in that (i) noise is easily generated if a character string partially matches a search token, for example, if a word [“to” “kyo” “to”] is determined to match with a search token [“kyo” “to”], and (ii) this method cannot cover synonymous variations of a word, such as the conjugation of a word registered as a token.
FIG. 12 (Prior Art) show processing of referencing search information based on conventional morphological analysis method and N-gram method. FIG. 12(a) shows referencing processing based on the morphological analysis method, and FIG. 12(b) shows referencing processing based on the N-gram method. It is assumed that a user operates a client computer to enter search words, [“to” “cho”] and [“to” “ri” “atsuka” “i” “ji” “kan”] to send a search request to a search engine through a network. The search engine includes, for example, a relational database, and inquires information managed with the relational database to search for the received search words using an SQL statement or the like.
It is assumed here that document data in the search-target information in this example includes metadata, a title, or headline information like[“to” “kyo” “to” “cho” “no” “go” “an” “nai” “tori” “atsuka” “i” “ji” “kan”]. The morphological analysis method segments a character string of the information into tokens using a dictionary, associates tokens different in notation such as synonymous words, conjugational words, or words having different declensional kana endings, with a corresponding token in the document data, and registers tokens inclusive of notational variations in an index list together with their positions or token numbers.
In the conventional example of FIG. 12(a), document data in the information is segmented into tokens of [“to” “kyo” “to”], [“cho”], [“no”], [“go”], [“an” “nai”], [“tori” “atsuka” “i”], and [“ji” “kan”]. As for the token “toriatsuka-i”, a token [“to” “ri” “atsuka” “i”] that is different in declensional kana ending is indexed in association with the original token [“tori” “atsuka” “i”]. Under the above condition, if the referencing processing of FIG. 12(a) is performed, the search token [“to” “cho”] is not registered in the index list of the information, so the search engine indicates mishits. On the other hand, as for the search tokens [“to” “ri” “atsuka” “i”] and [“ji” “kan”], a corresponding token is registered in the index list, so the search engine indicates hit counts.
In the illustrated example of FIG. 12(a), if the search word [“to” “cho”] is not found, some patterns of search results are sent back depending on an implementation method of a search system; “mishit” is sent back as a search result, or a reliability (probability) is assigned, and a search result that ranks targeted information behind the other information is sent back.
On the other hand, in the illustrated example of FIG. 12(b), the referencing processing is performed based on the N-gram method. As for the search words [“to” “cho”] and [“to” “ri” “atsuka” “i” “ji” “kan”], tokens [“to” “cho”], [“atsuka” “i”], [“i” “ji”], and [“ji” “kan”] are hit. These tokens are indexed in relation to the information through the N-gram method. On the other hand, as for search tokens [“to” “ri”] and [“ri” “atsuka”] derived from the search words, any corresponding tokens are not indexed, so the search system sends back “mishit”. In this case as well, some patterns of search results are sent back depending on an implementation method of a search system; “mishit” is sent back as the total search result, or a reliability (probability) is assigned, and a search result that ranks targeted information behind unintended information is sent back to the client computer.
This is mainly due to insufficient maintenance of the dictionary in FIG. 12(a) but in FIG. 12(b), is due to a problem of not considering conjugational or notational variations in the N-gram method.
A document search technique using the morphological analysis method and the N-gram method is disclosed in, for example, Japanese Unexamined Patent Application publication Nos. 2001-34623 (Patent Document 1), 2006-99427 (Patent Document 2), and 2006-106907 (Patent Document 3).
Patent Document 1 discloses an information search technique of segmenting a search-target text into unit words to generate word-information-added character string index including word information representing separation between words and having N characters to search for a search word through the word-information-added character string index based on one or both of character string search or word search. Further, another technique disclosed in Patent Document 1 is to record information about the boundary between morpheme words in association with information obtained through the N-gram method to thereby improve an accuracy of ranking. However, this technique cannot be directly applied to a search operation that reflects conjugational or notational variations that feature the morpheme in terms of using the word on the word boundary.
Further, Patent Document 2 discloses a full-text search technique including approximation degree determination means for determining the degree of approximation between hit counts upon primary search with an N-gram index and hit counts upon morpheme search with a morpheme index, and full-text search control means for controlling, if the approximation degree determination means determines the hit counts upon primary search with an N-gram index to approximate to the hit counts upon morpheme search with a morpheme index, first search means so as to skip secondary search with the N-gram index to use a result of the primary search or a result of the morpheme search as a search result.
Further, Patent Document 3 discloses a structured document management technique including index type determination means for determining an index type appropriate for each of a plurality of elements to be indexed in a structured document on an element basis, and index generating means for generating an index of the determined index type corresponding to the element and storing the index in index storage means.
In Patent Documents 1 to 3, one of the two methods, the morphological analysis method and the N-gram method is chosen to search for the information indexed in different manners using different character string analysis methods. Further, the indexes generated using different character string analysis methods are independently searched for, and search results are combined so that the final result includes a result of the morphological analysis and a result of the N-gram method.
However, in order to combine the results, it is necessary to perform complicated search processing as well as prepare both of a search engine for the morphological analysis method and a search engine for the N-gram method. This processing costs high. Further, also if a search result of the morpheme analysis/morphological analysis method and a search result of the N-gram method are generated and combined, each result involves advantages and disadvantages of each character string analysis method. Therefore, even if search operations are simply performed independently and search results are combined, in the conventional examples of FIG. 12, for example, the search result reflects problems of each method, so satisfactory accuracy and search cannot be realized. Here, “search quality” means such a quality that omission is minimized, noise can be removed enough, and a search result highly faithful to a search word (string) input by a user can be obtained.
Further, in the case of independently performing the search processing based on the two methods, two types of search engines should be prepared, as many search operations as the number of character string analysis methods should be performed, search results still involve advantages and disadvantages inherent in the individual character string analysis methods, and a search quality cannot be improved only by simply combining search results. From the viewpoint of the above drawbacks, such search processing is not preferable in terms of human/hardware resources, and quality and cost.