I. Field of the Invention
This invention relates to a method and apparatus for computerized associative comprehension between two natural language expressions.
II. Description of the Related Art
Computerized text retrieval systems aid a user in searching large numbers of computer coded documents to find one or more documents containing key words or specific combinations of keywords. Text retrieval systems, also called information data bases, contain one or more libraries of full text documents or specially coded summaries of documents. These summaries are usually limited to titles, abstracts, keywords and an index to the location of the hardcopy full text information. The text or summaries of the documents are usually stored on a magnetic disk in ASCII (American Standard Code for Information Interchange) form.
Typically, if a computer library contains full text documents, the documents have been "indexed" to provide special index word search files to speed the search. An example of an indexed full text retrieval system that runs on MS DOS based microcomputers is ZyINDEX by ZyLAB Corporation of Chicago, IL. A large commercial data base, DIALOG, run by Dialog Information Services of Palo Alto, CA, comprises data bases that provide both truncated documents consisting of title, abstract, keywords, and location information, and full text documents.
Advanced text retrieval systems use Boolean logic and literal character matching to match the characters of words in a user supplied query expression to words in a text document. A typical query expression may be "(TEXT W/3 RETRIEV!) OR (DATA W/5 BASE)." Taking this query, the computer searches each keyword index or each document in the database for a literal match to the word "TEXT" or "TEXTS" (the "S" is presumed by the search program) within three words ("W/3") of "RETRIEVE," "RETRIEVAL," "RETRIEVES," etc., where the "!" instructs the computer to accept all words starting with "RETRIEV." The computer also attempts to match "DATA" within five words of "BASE" or "BASES" because of the logical "OR" command.
The search consists of literal letter matching, i.e., if the word "TEXT" is found in the document along with "RETRIEVERS," a match is indicated. This may or may not be a relevant document; the phrase in the document could be "text on golden retrievers," whereas the user was searching for documents containing the phrase "text retrieval." Similarly, if a pertinent document discusses document retrieval, no match is found because the user specified "TEXT" in the query.
Some document retrieval systems use weighted word searches, where the number of occurrences of a word in a document determine the order of relevance. Documents with the most occurrences of the keyword are assumed to be the most relevant and are brought to the user's attention.
Literal text retrieval by character and word matching is essential where an exact word or known expression must be found, but has major drawbacks. The user constructing the query must be familiar with the author's word usage and document structure, otherwise relevant documents will be omitted. The user must be proficient in the use of the query language in order to construct a successful search; a successful search being one that retrieves only the most pertinent documents while ignoring extraneous documents that use similar words in different contexts.
Weighted word searches again require knowledge of the author's word usage and document structure, otherwise irrelevant documents are retrieved and important documents are missed. If a user wishes to find documents containing certain concepts or information, regardless of the specific terminology used in the document, a character or word matching search is difficult and prone to failure (i.e., missing relevant documents and/or gathering irrelevant documents) unless the user can guess the exact words used in each searched document to discuss the relevant concept Different authors frequently use different words to convey the same meaning.
It is therefore an object of the present invention to provide a computerized text associative comprehension system that will identify the documents or portions of documents representing related concepts to a user selected query expression regardless of whether the searched text literally matches the query expression words.
It is also an object of the present invention to provide a text and document retrieval system that uses plain language query expressions.
It is also an object of the present invention to provide a text and document retrieval system that is relatively language independent, that is, the query expression words may be written in one language and the text to be associated may be in a number of different languages.
It is also an object of the present invention to provide a text and document retrieval system that has an unlimited vocabulary in several languages.
Additional objects and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities, methods and combinations particularly pointed out in the appended claims.