The present invention relates to a method and apparatus for retrieving a database storing electronic data such as documents, images, and the like using a relevance feedback.
In recent years, as electronic data increases, there is an increasing demand for retrieving the electronic data more efficiently. In order to satisfy such demand, the so-called similarity-based retrieval is used as a retrieval technique. The similarity-based retrieval techniques include the relevant documents retrieval for retrieving documents similar to those specified in a query, the relevant image retrieval for retrieving images similar to those specified in a query, etc. The following description uses the relevant documents retrieval as an example to explain the similarity-based retrieval.
The relevant document retrieval or search process involves a query and one or more documents to be retrieved (hereafter referred to as a “retrieval-oriented document”) as vectors each of which elements is occurrence information about a character string capable of being an independent word (hereafter referred to as a “characteristic string”). The document retrieval process calculates an inner product of the query's vector (hereafter referred to as a “query vector”) and the retrieval-oriented document's vector (hereafter referred to as a “registered document vector”) as similarity of the retrieval-oriented document for the query. As a result, desired documents can be effectively retrieved by referencing the retrieval-oriented documents in ascending order of the calculated similarities.
If a user does not properly specify desired documents or input an appropriate query, the documents retrieved are not relevant to the user's needs.
As a technology to solve this problem, a relevance feedback search method has been proposed in JP-A No. 117937/2001, for example, where the user provides relevance evaluation of the retrieved documents. The query is modified based on the evaluation. The modified query is used to perform another search. The relevance feedback is described.
FIG. 2 is used to describe an outline of the relevant documents retrieval method according to a conventional technique, e.g., the JP-A No. 117937/2001.
The relevant document retrieval in this description expresses a query and a retrieval-oriented document as a query vector and a registered document vector, respectively, whose elements are term frequencies of a characteristic string. The retrieval then calculates the similarity of the registered document vector to the query vector. The conventional technique uses Eq. 1 to calculate the similarity.
                              S          ⁡                      (            D            )                          =                              ∑            i            T                    ⁢                                          ⁢                      {                                          Frq                ⁡                                  (                                      i                    ,                    D                                    )                                            ×                              w                ⁡                                  (                  i                  )                                                      }                                              Eq        .                                  ⁢        1                            where S(D) is the similarity of registered document vector D to the query vector, T the number of characteristic string differences (total number of different characteristic strings), Frq(i,D) the term frequency of characteristic string i in document D, and w(i) the weight for characteristic string i of the query vector determined by the term frequency of characteristic string i in documents specified in the query.        
A query vector 201 in FIG. 2 has weight 3 for a characteristic string A, weight 2 for a characteristic string B, weight 2 for a characteristic string C, weight 3 for a characteristic string D, and weight 1 for a characteristic string E. Here, the query vector 201 is expressed as (3,2,2,3,1). A database 202 registers registered document vector (1,1,1,0,1) for a document 1 containing one characteristic string A, one characteristic string B, one characteristic string C, and one characteristic string E; registered document vector (1,1,1,0,0) for a document 2 containing one characteristic string A, one characteristic string B, and one characteristic string C; and registered document vector (0,1,0,1,1) for a document 3 containing one characteristic string B, one characteristic string D, and one characteristic string E.
When the relevant document retrieval process is executed, a similarity calculation and sort process 203 calculates similarities of the registered document vectors in the database 202 to the query vector 201 according to Eq. 1. The documents are sorted in descending order of the similarities. Consequently, a retrieved result 204 is obtained, showing similarity 8 for the document 1, 7 for the document 2, and 6 for the document 3.
FIG. 2 is also used to explain an outline of relevance feedback processing according to the conventional technique in addition to the relevant document retrieval process described above. The example in FIG. 2 shows processes when a user evaluates the document 3 in the retrieved result 204 to be “relevant,” i.e., the document is a target document sought by the user or significantly relates to such a document. The conventional technique modifies characteristic string weights in the query vector according to Equation 2.
                                          w            ′                    ⁡                      (            i            )                          =                              w            ⁡                          (              i              )                                +                      α            ⁢                                          ∑                j                P                            ⁢                              FP                ⁡                                  (                  j                  )                                                              -                      β            ⁢                                          ∑                k                N                            ⁢                              FN                ⁡                                  (                  k                  )                                                                                        Eq        .                                  ⁢        2                            where w′(i) is a new weight for characteristic string i, w(i) the original weight, FP(j) the term frequency of characteristic string i included in the jth document evaluated to be “relevant”, and FN(k) the term frequency of characteristic string i included in the kth document evaluated to be “not relevant,” i.e., the document is not a target document sought by the user or significantly relates to such a document. In the equation 2, P is the number of documents evaluated to be “relevant” and N is the number of documents evaluated to be “not relevant”. The process example uses parameters a and b each of which is set to 1.        
When a user evaluates the document 3 to be “relevant” at a user's evaluation 205, an evaluation result read process 206 reads the evaluation result.
According to the evaluation result, a registered document vector acquisition process 207 obtains a registered document vector 208 for the document 3 from the database 202.
Using Eq. 2, a query vector modification process 209 adds the weight of each characteristic string in the registered document vector 208 of the document 3 to each element of the query vector 201. The query vector 201 is modified as a query vector 201a having weights of (3,3,2,4,2).
Then, a similarity or relevance calculation and sort process 210 calculates similarities for the registered document vector in the database 202 using the query vector 201a, resulting in similarity 10 for the document 1, similarity 8 for the document 2, and similarity 9 for the document 3. Consequently, the retrieval-oriented documents are sorted in descending order of the similarities to obtain a retrieved result 211 after the relevance feedback (hereafter referred to as a second retrieved result) which advances the rank for the document 3 evaluated to be “relevant”.
In this manner, the conventional technique can improve the retrieval accuracy by using the relevance feedback. However, the relevance feedback makes it difficult for the user to determine when to terminate the retrieval.
FIG. 3 illustrates the above-mentioned problem specifically. The relevance feedback is performed by evaluating the document 5 in a first retrieved result 301 (also referred to as, “reference search result”) to be “relevant”. In one case, a large rank change is found in the transition from the first retrieved result 301 to a second retrieved result 302 (also referred to as, “subsequent search result”). In the other case, a small rank change is found in the transition from the retrieved result 301 to the second retrieved result 302