1. Field of the Invention
The present invention relates to a scheme for filtering documents which extract and output documents that match a relevant profile indicating a user's request from search target documents on a network.
2. Description of the Background Art
The document filtering technique of this kind is effective in extracting only information that matches with preferences of a user from a stream of a large amount of text information such as that of news information delivery service using e-mails, for example, and providing it to the user. Namely, the document filtering is a task for acquiring only documents that satisfy the user's request from sequentially arriving search target documents and providing them to the user.
In such a document filtering, the user's request is expressed inside a filtering system as a profile. Then, the filtering system judges whether this profile is satisfied or not with respect to each one of the sequentially arriving search target documents, and presents only those documents satisfying the user's request to the user. The user judges whether each presented document actually satisfies the request or not, and provides a feedback on that judgement to the filtering system. In many cases, the filtering system makes an improvement on the filtering accuracy by updating the profile according to the feedback from the user.
Most of the filtering systems employ techniques used in the information retrieval. In many cases, documents and profiles to be entered into the system are expressed inside the system according to vector space model or the like and a similarity between the profile and the document is used as a criterion for judging whether each document satisfies the profile or not. Also, the retrieval formula expansion method of the information retrieval is often applied to the profile updating of the document filtering. Namely, the profile is updated by adding information extracted from the selected documents to the profile according to the relevance feedback information from the user, so as to refine the profile.
Now, the processing procedure of the conventional document filtering method utilizing such a profile updating will be described with reference to FIG. 1.
In this document filtering method, the similarity between a profile q and each search target document d is calculated in order to search out documents similar to the profile q representing the user's request from the search target documents (step S71). Then, whether the calculated similarity between the profile q and the search target document d is higher than a prescribed threshold or not is judged (step S73). When the similarity is not higher than the prescribed threshold, the processing returns to the step S71 and the same processing is repeated for the next search target document, whereas when the similarity is higher than the prescribed threshold, a relevance feedback is obtained (step S7) and the profile q is updated (step S77). This processing is carried out for all the search target documents, and then the processing is terminated (step S79).
In such a document filtering method, the document filtering is carried out by a general procedure of selecting those documents for which the similarity with respect to the profile exceeds the threshold and then presenting them to the user. However, in this filtering method, setting of an appropriate threshold is difficult as will be described below, such that when the threshold is set low in order to select many relevant documents, the number of erroneously selected non-relevant documents will increase considerably, whereas when the threshold is set high in order to reduce the erroneously selected non-relevant documents, many relevant documents will be overlooked.
As already mentioned above, the retrieval formula expansion technique used in the information retrieval is often applied to the profile updating in such a document filtering method. Next, the profile updating method utilizing the retrieval formula expansion method based on word contributions in the information retrieval which can obtain a high accuracy will be described.
First, the retrieval formula expansion method based on word contributions will be described. The word contribution is a scale in which the influence of each word in the similarity between documents is expressed numerically. The word contribution of a word wi in the similarity between an input sentence q and a retrieval target document d is defined by the following equation (1).Cont(wi,q,d)=Sim(1,d)−Sim(q′(wi),d′(wi))  (1)where Sim(q, d) is a similarity between q and d, q′ (wi) is a sentence in which the word wi is excluded from the input sentence q, and d′ (wi) is a document in which the word wi is excluded from d.
Namely, the word contribution Cont (wi, q, d) is a difference between the similarity of q and d and the similarity of q′ and d′ in which the word wi is absent. Consequently, among all the words that appear in q and d, the contribution of a word which raises the similarity is positive, and the contribution of a word which lowers the similarity is negative.
It is also known that many words appearing in documents have nearly zero contributions, and a rather small number of words have significant influences on the similarity. Among them, words that have large positive contributions are words which exist in both the input sentence and the retrieval target document. On the other hand, it is considered that words that have large negative contributions are words which exist only in either one document and which are expressing characteristics of that document prominently. For this reason, in the retrieval formula expansion method based on word contributions, the expansion of the retrieval formula is carried out as follows.
First, when a group of documents that are matching with the input sentence q:Drel(q)={d1 . . . , dNum}  (2)is given, the contributions of all words appearing in each document belonging to Drel(q) are obtained, and N words with low word contributions are extracted from each similar document. Next, a total sum of the contributions by each extracted word w is multiplied by a weight “wgt”, and this is taken as a score with respect to the word w. When the contribution of the word w with respect to the input sentence q and the document d is denoted as Cont(w, q, d), the score Score(w) of the word w can be expressed by the following equation (3).
                              Score          ⁡                      (            w            )                          =                  wgt          ⨯                                    ∑                              d                ⁢                                                                  ⁢                ε                ⁢                                                                  ⁢                                                      D                    re1                                    ⁡                                      (                    q                    )                                                                        ⁢                                                  ⁢                          Cont              ⁡                              (                                  w                  ,                  q                  ,                  d                                )                                                                        (        3        )            
Then, the retrieval formula expansion is realized by adding those words that are not contained in the original retrieval formula among the extracted words, to the retrieval formula.
At a time of adding some word w into a vector of the input sentence, the score Score(w) calculated by the equation (3) is regarded as a frequency for which the word w appears in the input sentence (word appearance frequency tf), and the value of an element expressing the word w in the vector of the input sentence is calculated. When each element of the vector is calculated by TF*IDF, the retrieval formula expansion is realized by calculating TF by setting Score(w) as tf and multiplying IDF of the word w, and entering the resulting TF*IDF value into an element for the word w in the vector of the input sentence.
Next, the profile updating method based on word contributions will be described.
In the retrieval formula expansion based on word contributions, the score of each word is obtained by multiplying a weight to a total sum of contributions of words that are extracted according to their contributions from each document in the relevant document set that is obtained according to the feedback with respect to the initial retrieval result. Here, words are extracted according to their contributions from each document selected during the filtering, and a profile is updated sequentially by adding information on the extracted words to an immediately previous profile.
First, when the selected document is a relevant document, the score Scorerel (wi) of the extracted word wi is calculated by the following equation (4), or when the selected document is a non-relevant document, the score Scorenrel (Wi) of the extracted word wi is calculated by the following equation (5).Scorerel(wi)=wgtrelR×Cont(wi, q, d)  (4)Scorenrel(wi)=wgtnrelR×Cont(wi , q, d)  (5)
Then, the weight for each word is calculated by the TF*IDF method by treating the score of each word as obtained by the above equation as a word appearance frequency tf. Then, a word and its weight are added to the original profile when the extracted word is a word in the relevant document, or a word and its weight are subtracted from the original profile when the extracted word is a word in the non-relevant document. Namely, an element for each word selected from the relevant document is added to the original profile and an element for each word selected from the non-relevant document is subtracted from the original profile. Note that the words with negative weights will not be used in the similarity calculation as a result of this processing.
By the above processing, positive values are given to originally valueless dimensions of the vector representing the profile, so that the profile information is expanded. Also, the weights for words originating from both the relevant documents and the non-relevant documents are suppressed, while the weighs of words that appear only in the relevant documents are emphasized.
Here, both the search target document and the profile are expressed by using the vector space model, and the filtering with respect to each document is realized by calculating the similarity between them.
In expressing each document and profile by using the vector space model, the weight for each element of the vector representing each document or profile is calculated by the TF*IDF method. Here, the calculation formulas for TF and IDF to be used are those based on an algorithm used in the SMART which is one of the most effective information retrieval systems, which are given by the following expressions (6) and (7).TF factor: log(1+tfij)  (6)
                    IDF        ⁢                                  ⁢        factor        ⁢                  :                ⁢                                  ⁢        log        ⁢                                  ⁢                  (                      M                          df              j                                )                                    (        7        )            where tfij is an appearance frequency of a word wj in a document di, dfj is the number of documents in which a word wj appears, and M is the number of documents contained in the document set used at a time of vocabulary compilation.
Also, the similarity is obtained in terms of normalized values by taking a cosine of the vectors for the profile and the search target document as defined by the following equation (8).
                              cos          ⁢                                          ⁢                      (                                          q                →                            ,                              d                →                                      )                          =                                            q              →                        ·                          d              →                                                                                        q                →                                                    ⁢                                                        d                →                                                                                      (        8        )            where {right arrow over (q)} and {right arrow over (d)} are the vectors representing the profile and the search target document respectively, and |{right arrow over (d)}| is the Euclidean length of {right arrow over (d)}.
In this profile updating method, TF and IDF are calculated by using the expressions (6) and (7) by treating the score of each word as obtained by the equation (4) or (5) as a word appearance frequency tf. Consequently, the TF*IDF value of each word is calculated by the following equations (9) and (10).
                                          Value            rel                    ⁡                      (                          w              i                        )                          =                  log          ⁢                                          ⁢                      (                          1              +                                                Score                  rel                                ⁡                                  (                                      w                    i                                    )                                                      )                    ×                      log            ⁢                                                  [                          M                              df                i                                      ]                                              (        9        )                                                      Value                          n              ⁢                                                          ⁢              rel                                ⁡                      (                          w              i                        )                          =                  log          ⁢                                          ⁢                      (                          1              +                                                Score                  nrel                                ⁡                                  (                                      w                    i                                    )                                                      )                    ×                      log            ⁢                                                  [                          M                              df                i                                      ]                                              (        10        )            where dfi is the number of documents in which a word wi appears, and M is the number of documents used in producing a list of df.
Also, the profile q and the document d are expressed according to the vector space model, by the following equations (11) and (12) respectively.{right arrow over (q)}=(q1, . . . , qn)  (11){right arrow over (d)}=(d1, . . . , dn)  (12)where q1, . . . , qn are weights for words in the profile, d1, . . . , dn are weights for words in each document, and n is the number of dimensions of the vector.
The profile after updating can be expressed by the following equation (13), where each element for each extracted word wi is given by the following equation (14) in the case of a word in the relevant document, or by the following equation (15) in the case of a word in the non-relevant document.{right arrow over (q)}new=(q1′, . . . , qn′)  (13)qi′=qi+Valuerel(wi)  (14)qi′=qi−Valuenrel(wi)  (15)
In other words, an element of each word selected from the relevant document is added to the elements of the original profile and an element of each word selected from the non-relevant document is subtracted from the elements of the original profile. Note that the words with negative weights will not be used in the similarity calculation as a result of this processing.
FIG. 2 shows the procedure of this profile updating method which proceeds as follows. As shown in FIG. 2, with respect to the profile q, the updated profile qnew, and the selected document d which are expressed according to the vector space model, a word set W is extracted from the selected document d first (step S83), and whether the selected document d is a relevant document or not is judged (step S85). When the selected document d is a relevant document, the score of each word is calculated by the equation (4) (step S87), and the score of each word wi is added to the profile q as in the equation (14) (step S89). When the selected document d is not a relevant document, the score of each word is calculated by the equation (5) (step S88), and the score of each word wi is subtracted from the profile q as in the equation (15) (step S91). Then, the profile after such addition or subtraction is set as the updated profile qnew (step S93).
The evaluation test using data prepared by the Filtering Track of TREC-8 was conducted according to the above described profile updating method. FIG. 3 shows the similarities with respect to the profile of the relevant documents and the non-relevant documents that were selected when the threshold of the similarity was set equal to 0.1.
As can be seen from FIG. 3, there are only few non-relevant documents with the similarities that are considerably higher than the threshold, but the relevant documents and the non-relevant documents are coexisting for the similarities in vicinity of the threshold.
Thus, in the conventional document filtering method, there are only few non-relevant documents with the similarities that are considerably higher than the threshold, but the relevant documents and the non-relevant documents are coexisting for the similarities in vicinity of the threshold, so that it is impossible to select only the relevant documents with these similarities.
For example, when the threshold is set low in order to select many relevant documents, the number of erroneously selected non-relevant documents will increase, whereas when the threshold is set high, the number of erroneously selected non-relevant documents can be reduced but the number of correctly selected relevant documents will also be reduced.
In other words, when an attempt to obtain many relevant documents is made by applying the retrieval formula expansion method in the information retrieval to the profile updating and simply setting the threshold of the similarity, there has been a problem that many non-relevant documents will also be selected.