Locating information using large amounts of natural language documents (referred to often as text data) is an important problem. Current commercial text retrieval systems generally focus on the use of keywords to search for information. These systems typically use a Boolean combination of keywords supplied by the user to retrieve documents. See column 1 for example of U.S. Pat. No. 4,849,898, which is incorporated by reference. In general, the retrieved documents are not ranked in any order of importance, so every retrieved document must be examined by the user. This is a serious shortcoming when large collections of documents need to be searched. For example, some data base searchers start reviewing displayed documents by going through some fifty or more documents to find those most applicable.
Statistically based text retrieval systems generally rank retrieved documents according to their statistical similarity to a user's search request (referred to often as the query). Statistically based systems provide advantages over traditional Boolean retrieval methods, especially for users of such systems, mainly because they allow for natural language input.
A secondary problem exists with the Boolean systems since they require that the user artificially create semantic search terms every time a search is conducted. This is a burdensome task to create a satisfactory query. Often the user will have to redo the query more than once. The time spent on this task is quite burdensome and would include expensive on-line search time to stay on the commercial data base.
Using a list of words to represent the content of documents is a technique that also has problems of it's own. In this technique, the fact that words are ambiguous can cause documents to be retrieved that are not relevant to the search query. Further, relevant documents can exist that do not use the same words as those provided in the query. Using semantics addresses these concerns and can improve retrieval performance. Prior art has focussed on processes for disambiguation. In these processes, the various meanings of words (also referred to as senses) are pruned (reduced) with the hope that the remaining meanings of words will be the correct one. An example of well known pruning processes is U.S. Pat. No. 5,056,021 which is incorporated by reference.
However, the pruning processes used in disambiguation cause inherent problems of their own. For example, the correct common meaning may not be selected in these processes. Further, the problems become worse when two separate sequences of words are compared to each other to determine the similarity between the two. If each sequence is disambiguated, the correct common meaning between the two may get eliminated.
The inventor of the subject invention has used semantics to avoid the disambiguation problem. See U.S. patent application Ser. No. 08/148,688 filed on Nov. 5, 1993 which issued as U.S. Pat. No. 5,576,954 on Nov. 19, 1996. For semantics, the various meanings of words are not pruned but combined with the various meanings of other words and the statistically common meanings for small groups of words yield the correct common meaning for those words. This approach has been shown to improve the statistical ranking of retrieved information. In the semantic approach, the prunning process for common meaning is replaced by a statistical determination of common meaning. Crucial to this approach is the fact that retrieval documents must be small.
Relevance feedback has sometimes been used to improve statistical ranking. For relevance feedback, the judgements of the user concerning viewed information are used to automatically modify the search for more information. However, in relevance feedback, conventional IR (Information Retrieval) systems have a limited recall. G. Salton, Automatic Information Organization and Retrieval, McGraw-Hill, 1968. This limited recall causes only a few relevant documents are retrieved in response to user queries if the search process is based solely on the initial query. This limited recall indicates a need to modify (or reformulate) the initial query in order to improve performance. During this reformulation, it is customary to have to search the relevant documents iteratively as a sequence of partial search operations. The results of earlier searches can be used as feedback information to improve the results of later searches. One possible way to do this is to ask the user to make a relevance decision on a certain number of retrieved documents. Then this relevance information can be manually used to construct an improved query formulation and recalculate the similarities between documents and query in order to rank them. This process is known as relevance feedback.
A basic assumption behind relevance feedback is that, for a given query, documents relevant to it should resemble each other in a sense that they have reasonably similar keyword content. This implies that if a retrieved document is identified as relevant, then the initial query can be modified to increase its similarity to such a relevant document. As a result of this reformulation, it is expected that more of the relevant documents and fewer of the nonrelevant documents will be extracted. The automatic construction of an improved query is actually straightforward, but it does increase the complexity of the user interface and the use of the retrieval system, and it can slow down query response time. Essentially, document information viewed as relevant to a query can be used to modify the weights of terms and semantic categories in the original query. A modification can also be made using documents viewed as not relevant to a query.
The main problems with using relevance feedback are many. First, the original query becomes very large whenever all the words in a viewed relevant document are added to the original query. Secondly, it takes a long time to read large documents and decide if they are relevant or not. Another problem is that often only part of a large document is actually relevant. Other patents have tried to address this problem. See U.S. Pat. No. 5,297,027 to Morimoto et al.
The inventor is not aware of any prior art that combines statistical ranking, semantics, relevance feedback and using sentences (or clauses) as documents when queries are expressed in natural language in order to be able to search for and retrieve relevant documents.