Although computer programs can analyze some simple, isolated sentences in English and other natural languages, computer programs are not likely to be able to understand free-form natural language text for many years. Consequently, there are large bodies of text which are difficult to access and are, therefore, either not utilized at all or significantly under-utilized.
The standard technique for accessing such free-form texts is by means of a "key word" combination, generally with boolean connections between the words or terms. An "inverted" file or key word index may be provided to reduce the time required for such search.
Such key word searching techniques have a number of drawbacks. First, such systems are not adapted to accept a search request in the form which would normally be posed by a user. Instead, the user must be able to determine the terms and the boolean connections which will yield the desired information. Such a procedure requires a certain level of sophistication on the part of the user. Even with a sophisticated and experienced user, because of the vagaries of the natural language utilized, and for other reasons, a search is frequently drawn too broadly so as to yield far more "hits" than the user wishes to review, or is drawn too narrowly so that relevant items are missed. In many instances, a search strategy is evolved by trial and error so that an acceptable number of relevant hits are obtained. While some front end programs are available for assisting a user in developing a search strategy, the use of such front-end programs is also time-consuming and, depending on the skill of the user, may still not result, at least initially, in a proper search strategy being evolved.
A second problem with key word searches is that they in fact require that every word of the text be looked at during a search. Since textual databases such as those containing legal cases, scientific or medical papers, patents and the like may contain millions of words of text, a full key word search can often be a time-consuming procedure. This problem is generally dealt with through use of inverted file indexes which lead a user to an area containing a particular key word so that the entire body of text need not be searched. However, inverted file arrangements are not appropriate in all situations. The basic problem is the size of the inverted index, which can equal, or even exceed, the size of the main document file. Further, in order to maintain the usefulness of the inverted file, it is necessary that the large inverted file of key words be updated each time the main database is updated. For these and other reasons, inverted files are generally appropriate only if:
(1) the vocabulary of the text or database is homogeneous and standardized, as in medicine and law, and relatively free of spelling errors;
(2) contiguous word and word proximity processing (where the location of the search words in the text is important) is not necessary and, hence, extensive location information is not required (i.e. extensive boolean connections between words are not critical in the key word search process);
(3) a limited class of text words (rather than the full text) are used for search purposes so that only these terms need appear in the inverted index;
(4) the database is not too large.
When the criteria indicated above are not met, an inverted file may be of only limited use in reducing the time required for a given search.
Thus, since the quantity of text is ever increasing, it becomes progressively more difficult to rapidly locate relevant text to a particular question, assertion or the like; and it becomes virtually impossible to locate all text relevant to such an item. However, particularly in such areas as law and patents, failure to locate a relevant precedent or patent can have catastrophic results.
Therefore, a need exists for an improved method and apparatus for retrieving relevant text from large databases, and in particular for permitting such retrieval to be accomplished by a relatively unsophisticated user. Preferably, the user should merely be required to present a query in English (or other natural language) in a way which is normal for the user, with little or no preconditions attached.
It should also be possible to complete searches on all types of text, and to complete the search expeditiously. To improve the response time on database searches, it is preferable that it not be necessary to refer to each word of the text during a retrieval operation and it is also desirable that, once relevant text is located, it be possible to locate other relevant text from the initially-located relevant text without necessarily requiring further search of the textual material.