Systems have been realized for searching large document sets to extract and present documents including a character string entered as a query. In general, an information search apparatus performs processing (called lookup) for specifying documents including a character string entered as a query from among document information stored in a word index DB of the information search apparatus, and performs processing (called ranking) for calculating degree of similarity (also called score) between the character string and each document including the character string. Then, the information search apparatus displays documents (which may be document IDs) in descending order of the degree of similarity as search results. By the way, in the present specification, “document” is a set of sentences of a unit, “sentence” is a unit of character string separated by periods. For example, a document file can be considered as a list of sentences. In the following, an example of a conventional technique is described in more detail.
FIG. 1 shows a configuration example of a conventional information search apparatus 10. In FIG. 1, the information search apparatus 10, a client 20 and a network 30 form an information search system. In the figure, the word index DB 3 stores information of document sets (in the present specification, “document” may be used to mean “document set”), which are subjects of search, based on a data structure that makes it easier to search the DB, and stores inverted indexes in this example.
Conventional inverted indexes includes, for each word, document IDs in which the word occurs, and occurrence positions for the word in the document. In addition, occurrence frequency of the word may be included in the inverted indexes.
In the information search apparatus 10 shown in the figure, the character string transferred from client 20 is input by a client input reception unit 1. The input character string is divided into each word by the character string information search unit 2. Then, for each divided word, the character string information search unit 2 obtains, from the word index DB 3, document IDs of documents in which the word is included and occurrence position of the word in the document.
There is a case in which words forming the query include words divided from a compound word such as “Tokyo-To” which can be divided into “Tokyo” and “To”. Since the compound word has a meaning by the adjacent two words, it is common, in the lookup processing, to perform processing for checking whether two words are adjacent with each other in the search subject document. This is called adjacency processing. A concrete example of the adjacency processing in the information search apparatus 10 is as follows.
When the input character string is “Tokyo-To”, the character string information search unit 2 of the information search apparatus 10 divides “Tokyo-To” into “Tokyo” and “To”. Then, the character string information search unit 2 obtains document ID in which each word obtained by the division exists, and obtains occurrence position of the word from the word index DB 3. FIG. 2 shows an example of obtained information. FIG. 2 shows that “Tokyo” is included in documents of document IDs 133, 144 and 170. In addition, it is shown that, in the document of document ID 133, occurrence frequency of “Tokyo” is 2 and “Tokyo” occurs at 5th and 22nd positions in the document, and that, in the document of document ID 144, the occurrence frequency is 3 and “Tokyo” occurs at 1st, 11th and 18th positions in the document. Similarly, as for “To”, occurrence document, occurrence frequency, and word-based occurrence position are shown.
Then, the character string information search unit 2 checks presence or absence of a document in which “To” occurs adjacent to “Tokyo”, and confirms that “To” occurs adjacent to “Tokyo” in the document of document ID 144. Accordingly, the character string information search unit 2 can output the document ID 144 as a document including the compound word of “Tokyo-To”.
In the divided words, since it is unknown initially which words form a compound word, the character string information search unit 2 performs adjacency processing for every document having any one of words included in the query, so that there is a problem in that it requires high processing cost. In addition, in the word index DB having word-based position information, there is a problem in that a large amount of resources are required for the position information.
After the lookup processing ends as mentioned above, the similarity calculation unit 4 calculates the degree of similarity between the input character string and each document in which the character string occurs using information obtained from the word index DB 3, and transfers documents as the results in descending order of the degree of similarity to the client output unit 5.
The degree of similarity between the character string and the document is calculated by using TFIDF (Term Frequency Inverse Document Frequency) (non-patent document 1). In this case, the similarity calculation unit 4 is shown in detail in FIG. 3. As shown in FIG. 3, the similarity calculation unit 4 includes a word importance similarity calculation unit 41 for calculating degree of similarity using word importance (idf), a word frequency similarity calculation unit 42 for calculating degree of similarity using word frequency (tf). The word importance is multiplied by the word frequency so that the degree of similarity between the document and the word is obtained. The degree of similarity is calculated for every word forming the character string, and a sum of the degree of similarity for every word is calculated, so that the degree of similarity between the character string that is the query and the document is obtained.
            sim      ⁡              (                  Q          ,          d                )              =                  ∑                  w          ∈          q                    ⁢                          ⁢              w        di                        w      di        =                  tf        ⁡                  (                      w            ,            d                    )                    *              idf        ⁡                  (          w          )                    
The above equations indicate the method for calculation. In the equations, sim(Q,d) is a function representing the degree of similarity between the query Q and the document d, wdi indicates a score of word w constituting the query Q, wherein the score is calculated by tf (the number of times of occurrence of word w in the document d) of the word w and idf (the number of documents where w occurs/total number of documents). In this case, the degree of similarity is calculated without using position information of word.
In the case when calculating the degree of similarity by checking adjacency, there are two similarity calculation methods when a compound word “Tokyo-To” including two words of q1 “Tokyo” and q2 “To” is input, for example. One is a method for calculating TFIDF by regarding q1 and q2 as w1 and w2 respectively, and another is a method for calculating the degree of similarity by regarding the compound word in which q1 and q2 are adjacent with each other as a word w. In the former method, position information is disregarded, and in the latter method, position information is used only by setting the score to be 1 when adjacent, and setting the score to be 0 when not adjacent. In addition, in the conventional technique, even when a plurality of words are input, position information of the words is not considered.
As a method for calculating degree of similarity between character string and documents, there is also a method called BM25 (non-patent document 2). However, in this method, similarly to TFIDF, the degree of similarity is calculated without considering occurrence position information of each word in the document when the character string includes a plurality of words.
In addition, there is also a method, for English, for calculating the degree of similarity by considering that, when a plurality of words are input, the plurality of words appear in proximity to each other (refer to non-patent document 3, for example).
[Non-patent document 1] Gerard Salton and Chris Buckley, Term Weighting Approaches in Automatic Text Retrieval, Information Processing and Management: an International Journal, Pages: 513-523 Vol. 24, Issue 5, 1988.
[Non-patent document 2] Stephen E. Robertson, Steve Walker, Micheline Hancock-Beaulieu, Asrron Gull, and Marianna Lau. Okapi at TREC3. In text Retrieval Conference, pages 21-30, 1992.
[Non-patent document 3] Tao Tao and ChengXiang Zhai. An exploration of proximity measures in information retrieval. In SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 295-302. New York, N.Y., USA, 2007. ACM Press.