There exist at this time huge quantities of material stored in various computer databases and this quantity of information is growing each year at an astonishing rate. Further, not only is the quantity of material increasing, but the types of material which can be digitized for storage in a computer database is also increasing each year. For example, technology has made it possible not only to store material represented in alphanumeric form, such as text, computer programs, object classes, action specifications and related data, but it is now possible to store all types of graphic materials including pictures, photographs, drawings, charts, video images and the like as well as audio recordings including voice images and prints and all types of music.
However, having such large quantities of material stored is of little value unless it is possible to retrieve a desired item of the material or to determine if such item is available in the database. Unfortunately, while current techniques for accessing such diverse databases are moderately adequate for accessing certain types of textual databases, such techniques are virtually non-existent for accessing other types of textual databases and for accessing graphic, audio or other specialized databases. Consequently, there are large bodies of computer-stored material which are either difficult or virtually impossible to access and are, therefore, either not utilized at all or significantly under utilized.
In particular, the standard technique for accessing free-form texts is by means of a "key word" combination, generally with boolean
connections between the words or terms. An "inverted" file or key word index may be provided to reduce the time required for such search.
Such key word searching techniques have a number of drawbacks. First, such systems can only be used for text and are of no use for other types of material. Second, such systems are not adapted to accept a search request in the form which would normally be posed by a user. Instead, the user must be able to determine the terms and the boolean connections which will yield the desired information. Such a procedure requires a certain level of sophistication on the part of the user. Even with a sophisticated and experienced user, because of the vagaries of the English or other natural language utilized, and for other reasons, a search is frequently drawn too broadly so as to yield far more "hits" than the user wishes to review, or is drawn too narrowly so that relevant items are missed. In many instances, a search strategy is evolved by trial and error so that an acceptable number of relevant hits are obtained. While some front-end programs are available for assisting a user in developing a search strategy, the use of such front-end programs is also time-consuming and, depending on the skill of the user, may still not result, at least initially, in a proper search strategy being evolved.
A third problem with key word searches is that they in fact require that every word of the text be looked at during a search. Since textual databases such as those containing legal cases, scientific or medical papers, patents and the like may contain millions of words of text, a full key word search can often be a time-consuming and therefore an expensive procedure. This problem is generally dealt with through use of inverted file indexes which lead a user to an area containing a particular key word so that the entire body of text need not be searched. However, inverted file arrangements are not appropriate in all situations. The basic problem is the size of the inverted index, which can equal, or even exceed, the size of the main document file. Further, in order to maintain the usefulness of the inverted file, it is necessary that the large inverted file of key words be updated each time the main database is updated. For these and other reasons, inverted files are generally appropriate only if:
(1) the vocabulary of the text or database is homogeneous and standardized, as in medicine and law, and relatively free of spelling errors PA1 (2) contiguous word and word proximity processing (where the location of the search words in the text is important) is not necessary and, hence, extensive location information is not required (i.e. extensive boolean connections between words are not critical in the key word search process); PA1 (3) a limited class of text words (rather than the full text) are used for search purposes so that only these terms need appear in the inverted index; PA1 (4) the database is not too large.
When the criteria indicated above are not met, an inverted file may be of only limited use in reducing the time required for a given search.
Thus, since the quantity of text is ever-increasing, it becomes progressively more difficult to rapidly locate text relevant to a particular question, assertion or the like; and it becomes virtually impossible to locate all text relevant to such an item. However, particularly in such areas as law and patents, failure to locate a relevant precedent or patent can have catastrophic results. Complete searching for graphics, such as a trademark logo for clearance purposes or a particular picture, for audio material or for other non-textual material, is virtually impossible. The result is that many functions which should or could be handled automatically by a computer are now being performed manually. For example, the "help-desk" function wherein users of for example a particular product or service may contact a particular E-mail address or telephone number to obtain information or assistance in dealing with various problems could be automated to provide textual, audio or graphic information in response to large classes of queries if stored material relevant to the queries could be easily located. However, the inability of computers to permit information to be retrieved in response to queries from large numbers of potential users, most of whom are unsophisticated in computer techniques, has resulted in the help-desk function normally being performed manually. Since queries frequently arrive in bunches, this can result in disgruntled users having to wait unacceptably long times to receive responses while some of the people performing the function may be under-utilized during other periods. It also means that a user has to rely on the expertise of the person providing the response where, particularly for sophisticated and fast changing technologies, the person manning the help-desk may not always be fully knowledgeable.
Therefore, a need exists for an improved method and apparatus for retrieving relevant material from large databases, and in particular for permitting such retrieval to be accomplished by a relatively unsophisticated user. Preferably, the user should merely be required to present a query in English (or other natural language) in a way which is normal for the user, with little or no preconditions attached.
It should also be possible to complete searches on all types of text, graphics, audio and other stored material and to complete the search expeditiously. To improve the response time on database searches, it is preferable that it not be necessary to refer to each word or other unit of the material during a retrieval operation and it is also desirable that, once relevant material is located, it be possible to locate other relevant material from the initially-located relevant material without necessarily requiring further searching. In particular, a match for a section of text should permit not only location of other relevant text, but also of relevant graphics, audio, etc. Such system should also facilitate the performance of an automated help-desk function wherein queries can be received in a variety of forms and received queries can be responded to automatically where data is available in the system for such response or manually (i.e. by a person) where relevant response data is not available from the system. Ideally, the system should be adaptive so that when a response is manually provided, such response is automatically available in the future in response to the same or similar query.