Prior art commercial text retrieval systems which are most prevalent focus on the use of keywords to search for information. These systems typically use a Boolean combination of keywords supplied by the user to retrieve documents from a computer data base. See column 1 for example of U.S. Pat. No. 4,849,898, which is incorporated by reference. In general, the retrieved documents are not ranked in any order of importance, so every retrieved document must be examined by the user. This is a serious shortcoming when large collections of documents are searched. For example, some data base searchers start reviewing displayed documents by going through some fifty or more documents to find those most applicable. Further, Boolean search systems may necessitate that the user view several unimportant sections within a single document before the important section is viewed.
A secondary problem exists with the Boolean systems since they require that the user artificially create semantic search terms every time a search is conducted. This is a burdensome task to create a satisfactory query. Often the user will have to redo the query more than once. The time spent on this task is quite burdensome and would include expensive on-line search time to stay on the commercial data base.
Using words to represent the content of documents is a technique that also has problems of it's own. In this technique, the fact that words are ambiguous can cause documents to be retrieved that are not relevant to the search query. Further, relevant documents can exist that do not use the same words as those provided in the query. Using semantics addresses these concerns and can improve retrieval performance. Prior art has focussed on processes for disambiguation. In these processes, the various meanings of words (also referred to as senses) are pruned (reduced) with the hope that the remaining meanings of words will be the correct one. An example of well known pruning processes is U.S. Pat. No. 5,056,021 which is incorporated by reference.
However, the pruning processes used in disambiguation cause inherent problems of their own. For example, the correct common meaning may not be selected in these processes. Further, the problems become worse when two separate sequences of words are compared to each other to determine the similarity between the two. If each sequence is disambiguated, the correct common meaning between the two may get eliminated.
Accordingly, an object of the invention is to provide a novel and useful procedure that uses the meanings of words to determine the similarity between separate sequences of words without the risk of eliminating common meanings between these sequences.