1. Field of the Invention
This invention relates to a document retrieval method, and more particularly to a document retrieval method capable of full text searching without the need for keyword or context-based information. This method can be used to identify, retrieve, and sort documents by topic or language. This method is also useful for identifying, retrieving, and sorting any form of communication such as acoustic signals (e.g., speech) and graphic symbols (e.g., pictures) that can be represented in machine readable format.
2. Description of Related Art
In "DARPA Eyes Smart Text Sifters", a published article by R. Colin Johnson in Electronic Engineering Times, Feb. 17, 1992, pp. 35 it was indicated that extensive research efforts have been expended to find better ways of searching textual databases in order to retrieve documents of concern to the user. It was indicated that several fundamental problems stand in the way of realizing any meaningful breakthroughs.
One technique to improve searches has been to create specialized hardware that can process information faster. The problem with this approach is that the improvements in processing speed have not kept pace with the rate at which database information has expanded. It was mentioned that a fundamental theoretical breakthrough was required to improve the way information is retrieved from large databases.
Conventional information retrieval systems are still based on using keywords or phrases with operators (e.g., and, or, not) to identify documents of interest. The problem with this technique is that documents may contain a synonym of the keyword rather than the keyword itself (e.g., car vs. automobile), or an inflected form of the keyword (e.g., retrieving vs. retrieve). Such systems are typically sensitive to spelling or data-transmission errors at the input. The operators may also be difficult to use. Additional problems include identifying appropriate keywords, identifying appropriate synonyms, and retrieving either insufficient, voluminous and/or extraneous documents. Typically an extensive table of synonyms is used to mitigate these problems. But this method increases memory requirements and slows processing time.
Another problem with keyword searches is that the meaning of the keyword usually depends on the context in which it is used. Therefore without some indication of the desired context of the keyword, the chances of retrieving unwanted documents are great. Prior approaches to document retrieval have attempted to overcome this problem by adding contextual information to the search using techniques such as context vectors, conceptual graphs, semantic networks, and inference networks. These techniques also increase memory requirements and slow processing time. Adding context information is also a task requiring significant time of a trained individual.
In "Global Text Matching for Information Retrieval", a published article by G. Salton and C. Bucklay in Science, Vol. 253, Aug. 30, 1991, pp. 1012-1015, it has been indicated that text analysis using synonyms is cumbersome and that text analysis using a knowledge-based approach is complex. This same article indicates that text understanding must be based on context and the recognition of text portions (i.e., sections of text, paragraphs or sentences).
In "Developments in Automatic Text Retrieval", a published article by G. Salton in Science, Vol. 253, Aug. 30, 1991, pp. 974-980, the present state of document retrieval is summarized. It indicates that text analysis is a problem because there is a need to retrieve only documents of interest from large databases. The typical solution to this problem has been to generate content identifiers. This has been done because the meaning of a word cannot adequately be determined by consulting a dictionary without accounting for the context in which the word is used. It was indicated that the words in the text can also be used for context identification. Such retrieval systems are defined as full text retrieval systems.
In "N-gram Statistics for Natural Language Understanding and Text Processing", a published article by C. Suen in IEEE Transactions On Pattern Analysis and Machine Intelligence, Vol. PAMI-1, No. 2, April 1979, two methods of processing natural language were described, one using keywords and a dictionary and one using n-grams. In the keyword approach, words are compared. In the n-gram approach, strings of letters are compared. Comparing strings of letters is faster and requires less memory than a keyword and dictionary method.
In U.S. Pat. No. 5,020,019, entitled "Document Retrieval System", a system is described that searches documents using keywords with a learning feature that allows the user to assign weight to the different keywords in response to the result of a previous search. The present invention does not use a keyword approach.
In U.S. Pat. No. 4,985,863, entitled "Document Storage and Retrieval", a method is described where documents are stored in sections. Sections of text, rather than keywords, are then used to retrieve similar documents. The present invention does not a keyword or sectioning approach.
In U.S. Pat. No. 4,849,898, entitled "Method and Apparatus to Identify the Relation of Meanings Between Words in Text Expressions", a method is described that uses a letter-semantic analysis of keywords and words from a document in order to determine whether these words mean the same thing. This method is used to retrieve documents or portions of documents that deal with the same topic as the keywords. The present invention does not use semantic analysis.
In U.S. Pat. No. 4,823,306, entitled "Text Search System", a method is described that generates synonyms of keywords. Different values are then assigned to each synonym in order to guide the search. The present invention does not generate synonyms.
In U.S. Pat. No. 4,775,956, entitled "Method and System for Information Storing and Retrieval Using Word Stems and Derivative Pattern Codes Representing Families of Affixes", a method is described that uses a general set of affixes that are used to modify each keyword stem. This method reduces memory requirements that would otherwise be needed to store the synonyms of each keyword. The present invention does not modify keyword stems.
In U.S. Pat. No. 4,358,824, entitled "Office Correspondence Storage and Retrieval System", a method is described that reduces documents to abstracts by recording the keywords used in each document. Keywords are then used to search for the documents of interest. The present invention does not replace the text of stored documents with keyword abstracts.