1. Field of the Invention
The present invention relates generally to a document retrieval system in which a desired document is retrieved from a data base in which pieces of electronic document information are stored, and more particularly to a document retrieval system in which a desired document is efficiently retrieved with high accuracy from a storing unit such as a document information data base, a general word processor or an office computer.
2. Description of the Related Art
A large number of pieces of electronic document information such as electronic mails, electronic catalogs, electronic publications and the like have been recently circulated, so that a document retrieval system in which a desired document is retrieved from a document information data base has been required.
In a conventional document retrieval system, a keyword search method for using a keyword manually attached to each of documents or a full-text search method for performing the collation of a plurality of character strings in a whole document with a reference character string with a high speed is generally used. In the full-text search method, any keyword attaching work manually performed is not required.
However, any of the above conventional methods is performed according to an exact match retrieval model in which the perfect agreement of query (or reference) information with superficial information such as a word, a character string or the like is required. That is, a user is required to prepare a retrieval formula (normally denoting a logical formula including a logical symbol "AND", "OR", "NOT" or the like) composed of a query character string as the query information. Therefore, unless the user is an expert for a document retrieval, it is difficult for the user to prepare an appropriate retrieval formula matching with a retrieval intention of the user. As a result, there is a case that all desired documents are not retrieved or a large number of unnecessary documents are undesirably retrieved, so that there is a drawback that a retrieval accuracy and a retrieval efficiency deteriorate together.
To avoid the above drawback, a large number of document retrieval methods according to an inexact match retrieval model have been proposed. In the inexact match retrieval model, a user specifies a natural language sentence or a typical sentence expressing a user's retrieval intention, so that the user is not required to prepare a logical formula. As a representative example of the inexact match retrieval model, a vector space model is well-known. In the vector space model, each of a plurality of documents and a query sentence is respectively expressed by a multi-dimensional characteristic vector, the similarity between the query sentence and each of the documents is measured by calculating a distance between the vector of the query sentence and the vector of a corresponding document, and the documents are ranked in order of similarity. A technique represented by the vector space model is disclosed in a literature "Salton, G. and McGill, M. J. :Introduction to Modern Information Retrieval, McGraw-Hill Publishing Company, 1983".
2.1. Previously Proposed Art
A document retrieval processing according to the conventional vector space model is described with reference to drawings.
FIG. 1 is a block diagram of a conventional document retrieval system operated according to the conventional vector space model. As shown in FIG. 1, a conventional document retrieval system 100 is composed of
a document storing unit 101 for storing a plurality of documents planned to be retrieved, PA1 a word occurrence frequency calculating unit 102 for reading out the documents from the document storing unit 101 and calculating an occurrence frequency of each of a plurality of words used in each of the documents, p1 a word occurrence frequency information storing unit 103 for storing the words and the occurrence frequencies calculated in the word occurrence frequency calculating unit 102 for each of the documents as word occurrence frequency information, PA1 an input and output unit 104 for receiving an input sentence from a user and outputting ranks of the documents to the user, PA1 an input sentence analyzing unit 105 for analyzing the input sentence received by the input and output unit 104 to extract a plurality of words from the input sentence as a plurality of query words used for retrieval and calculate an occurrence frequency of each of the query words and a weighting factor for each of the query words, PA1 an input word information storing unit 106 for temporarily storing the query words, the occurrence frequencies and the weighting factors obtained in the input sentence analyzing unit 105 as input word information, and PA1 a document rank calculating unit 107 for comparing and collating the word occurrence frequency information stored in the word occurrence frequency information storing unit 103 with the input word information stored in the input word information storing unit 106 to rank the documents in order of similarity to the input sentence and outputting a retrieval result indicating the ranks of the documents to the input and output unit 104. PA1 S1=.SIGMA. (weighting factor for one query word * occurrence frequency of one retrieval word agreeing with the query word)=1*2 (athletic)+1*3 (meeting)+1*1 (hold)=6 PA1 S=1*1 (athletic)+1*1 (meeting)+1*1 (hold)=3 PA1 S3=.SIGMA. (weighting factor for one query word * occurrence frequency of a corresponding retrieval word)=1*3 (athletic)+1*2 (meeting)+1*2 (hold)=7 PA1 word occurrence frequency calculating means for calculating an occurrence frequency of each of a plurality of retrieval words existing in each of a plurality of documents; PA1 functional unit extracting means for extracting one or more document sentence structural functional units from each of the documents; PA1 input sentence analyzing means for calculating an occurrence frequency of each of a plurality of query words existing in an input sentence and extracting one or more input sentence structural functional units from the input sentence; PA1 functional unit collating means for collating the document sentence structural functional units of each document extracted by the functional unit extracting means with the input sentence structural functional units extracted by the input sentence analyzing means to calculate a matching degree of each document with the input sentence and classifying the documents into a plurality of similarity levels according to the matching degrees of the documents; and PA1 document rank determining means for calculating an estimation value of each of the documents from the occurrence frequencies of the retrieval words of the corresponding document calculated by the word occurrence frequency calculating means and the occurrence frequencies of the query words of the input sentence calculated by the input sentence analyzing means according to an estimation function and ranking each of the documents on condition that one or more documents classified into a high similarity level by the functional unit collating means are ranked to a higher level than one or more documents classified into a low similarity level by the functional unit collating means and a plurality of documents classified into the same similarity level are ranked in the order of decreasing their estimation values; and PA1 outputting means for outputting a document retrieval result indicating the ranks of the documents. PA1 important portion extracting means for extracting an important portion expressing a theme of one document from each of the documents; and PA1 important functional unit extracting means for extracting one or more important sentence structural functional units from each of the important portions of the documents extracted by the important portion extracting means as the document sentence structural functional units. PA1 retrieval formula generating means for generating a retrieval formula expressed by a logical combination of the query words of the input sentence; and PA1 retrieval document selecting means for examining whether or not each of the documents satisfies the retrieval formula generated by the retrieval formula generating means by searching the whole documents and selecting one or more particular documents satisfying the retrieval formula from the documents, particular document sentence structural functional units of each particular document selected from the document sentence structural functional units of each document being collated with the input sentence structural functional units by the functional unit collating means.
In the above configuration, before a document retrieval is performed by a user, the documents stored in the document storing unit 101 are read out to the word occurrence frequency calculating unit 102, a morphological analysis is performed for each of the documents if necessary, each of the documents is divided into a plurality of retrieval words, an occurrence frequency of each of the retrieval words is measured for each of the documents, and the retrieval words and the occurrence frequencies are stored in the word occurrence frequency information storing unit 103 as word occurrence frequency information.
An example of a document planned to be retrieved is shown in FIG. 2.
As shown in FIG. 2, a first document "The first National Athletic Meeting was held at the Tokyo sport center on May 15. In this year, three thousands athletes took part in the Meeting from the whole nation, and the athletic meeting was the greatest scale in history." is stored in the unit 101. Also, a second document "An annual general meeting of the Nation Athletic Association was held at the Oriental hotel in the Shinagawa ward of Tokyo on December 10. The number of participators in this year is five hundreds, so that 30% is increased as compared with the number in an ordinary year." is stored in the unit 101.
An example of the word occurrence frequency information which are obtained from the first and second documents shown in FIG. 2 and is stored in the word occurrence frequency information storing unit 103 is shown in FIG. 3. In this example, occurrence frequencies for the retrieval words "hold", "national", "athletic", "sport", "center", "meeting", "Tokyo" and the like are measured, and any word relating to an article, a preposition, an auxiliary verb or the like or "be", "is", "are", "was" or "were" is not measured for convenience. The word "held" of each document is measured as the retrieval word "hold".
Thereafter, when the user intends to perform a document retrieval, the user inputs a natural language sentence (or a query document) to the input and output unit 104 as an input sentence, and the input sentence is divided into a plurality of words in the input sentence analyzing unit 105. In this case, a morphological analysis is performed for the input sentence, if necessary. Therefore, the words are extracted from the input sentence as a plurality of query words, and a weighting factor indicating an important degree of one query word is calculated for each of the query words. In general, each of the weighting factors is determined according to an occurrence frequency of a corresponding query word in the input sentence or a value input by the user. Thereafter, the query words, the occurrence frequencies of the query words and the weighting factors for the query words are stored in the input word information storing unit 106 as input word information.
An example of the input sentence "about holding the Athletic meeting" is shown in FIG. 4. An example of the input word information analyzed in the input sentence analyzing unit 105 is shown in FIG. 5. In this case, the word "holding" is analyzed as a query word "hold", and an occurrence frequency of each query word is set as a weighting factor.
Thereafter, the word occurrence frequency information stored in the word occurrence frequency information storing unit 103 is compared and collated with the input word information stored in the input word information storing unit 106 in the document rank calculating unit 107, a weighting factor for each of the retrieval words is calculated, and the documents are ranked in order of similarity to the input sentence. For example, a weighting factor for each of the query words is often calculated according to a weighting calculating equation indicated by tf*idf. Each of the documents is expressed by a multi-dimensional characteristic vector calculated from the weighting factors for the retrieval words and the occurrence frequencies of the retrieval words. In the same manner, the input sentence is expressed by a multi-dimensional characteristic vector calculated from the weighting factors for the query words and the occurrence frequencies of the query words. Thereafter, a similarity between each of the documents and the input sentence is calculated from the multi-dimensional characteristic vector of the corresponding document and the multi-dimensional characteristic vector of the input sentence according to an estimation function. For example, an inner product of the two multi-dimensional characteristic vectors is calculated or an angle between the two multi-dimensional characteristic vectors is calculated as the estimation function.
In cases where a simple estimation function is used, a similarity S1 between the first document and the input sentence is calculated as follows.
In the same manner, a similarity S2 between the second document and the input sentence is calculated as follows.
Therefore, because the similarity S1 is higher than the similarity S2, a rank for the first document is higher than that for the second document.
Thereafter, a retrieval result indicating the ranks of the documents is output to the input and output unit 104.
Accordingly, in the conventional document retrieval system 100 operated according to the conventional vector space model, because the user is not required to convert a sentence expressing a user's retrieval intention to a retrieval formula, the user can easily use the conventional document retrieval system 100. Also, because a plurality of documents are ranked in order of similarity to the natural language sentence (or a query document) expressing the user's retrieval intention, one or more desired documents matching with the user's retrieval intention can be efficiently retrieved by referring the ranks of the documents output from the document retrieval system 100.
2.2. Problems to be Solved by the Invention
However, in the conventional document retrieval system 100 operated according to the conventional vector space model, because the documents and the input sentence expressing the user's retrieval intention are converted to the multi-dimensional characteristic vectors by using the occurrence frequencies and the weighting factors, there is a drawback that one or more documents matching with the user's retrieval intention can not be retrieved with a high accuracy.
In particular, because it is difficult to express an original meaning of a document or contents of a theme of a document by using a characteristic vector in the calculation of similarity, even though contents of a particular document greatly differs from those of a necessary document matching with the user's retrieval intention, an occurrence distribution of query words in the particular document is similar to that in the input sentence, the particular document is ranked to a high level, so that there is a drawback that a plurality of documents cannot be correctly ranked in order of similarity to the user's retrieval intention.
For example, when a third document "a research meeting for a plurality of athletic teachers working in a plurality of senior high schools of the whole nation is held on August 3. This meeting is held to introduce various athletic coaching examples in an actual teaching load and discuss various problems in the athletic education performed in each of the schools" is stored in the storing unit 101, a similarity S3 between the third document and the input sentence is calculated as follows.
Therefore, because the similarity S3 is higher than the similarity S1 and the similarity S2, the third document is ranked to a higher level than those of the first and second documents. However, it is apparent that the contents of the first document match with the user's retrieval intention more than those of the third document.