The present invention relates to an information retrieval system and method which analyzes and summarizes the information contained in a group of texts and identifies similar words and word collections.
"Information retrieval" is the process of selecting and presenting specific items from within a large and heterogeneous collection of texts, according to users' descriptions of the subjects in which they are interested.
Some information retrieval systems index all the words appearing in all the texts, others index "keywords" which are descriptors assigned to each text by the text's author or by someone else. In both cases the user who wants to find a text does so by asking for a search on a particular word, or on logical (Boolean) combinations of words, or on words with some maximum distance (or similar relationship) between them in the texts, etc. In addition to requesting a specific word or words, most systems allow the user to search for a character string; e.g., LEXIS.TM. and DIALOG.TM..
A typical search request, on traditional systems, generates a long list or a large collection of texts all of which logically satisfy the search criterion, but only a small percentage of which will actually be of use. The user is forced to expend much time and much energy winnowing (searching) through the texts found by the system, to pick out those truly relevant to his needs.
This problem originates from the fact that the user typically does not have EXACT knowledge in advance of how the subjects of interest to him will have been described.
If his description is very specific, he will lose information: anything relevant to his needs but described in a slightly different manner will not be found by the system.
If his description is very general, many irrelevant texts will be found also, and the winnowing process will be costly, time consuming, and tiring.
For this reason:
On the level of office systems and personal computers, despite the proliferation of computers and the wide use of STRUCTURED data-bases, the use of personal and interpersonal catch-all text-based information retrieval systems is almost unknown--the bother and the overhead involved in using traditional systems are too great to make the effort worth the trouble.
On the level of massive public data-bases, most bibliographic information systems have attempted to solve the problem by limiting users to a predetermined vocabulary of acceptable keywords. (Users have a reasonable chance of guessing what their subject will have been called, since both users and authors are confined to that published list of keyword possibilities.) This solution has been workable but at a price:
(1) To be an effective user of such databases one must study and develop expertise in the use of the system. They are, thus, inaccessible to untrained users, and inappropriate for casual use. PA1 (2) Because of their rigid structure, such systems are of limited use (and indeed are little used) in dynamic environments such as would be found, for example, in the case of an unstructured corporation-wide catch-all collection of information.
Two general types of information retrieval systems and methods currently in use are as follows:
(1) In a first method, once the information retrieval systems (whatever their selection methodology) have isolated or identified the group of texts which satisfy the user's search criterion, the systems present the user with a count of the number of texts within that group, and the opportunity of sequentially reviewing the texts which are members of that group.
The user either looks through the texts themselves, one by one, or looks through sequential listings of some part of the information available about each text: that is, the user may choose to review a sequential list of the titles of the texts, or abstracts of the texts, or lists of keywords of the texts, or the initial paragraphs of the texts, or the dates and origins of the texts, or some combinations of the above. The user is then given some method of specifying (usually by number) those texts for which he wishes fuller information, printouts, etc.
An example of this type of information retrieval system is the DIALOG.TM. information retrieval system. DIALOG.TM. provides a user with the number of records (texts) satisfying the search request. The user can then request that any or all of the records be displayed and/or printed in any one of a number of formats containing varying and differing amounts of information.
(2) A second method is generally used when the number of texts presented by an initial search is too large, or the original search criterion was too general, to make it practical for the user to look through sequential listings to pick out the texts he wants. This method is essentially an extension of the original boolean search facility: the user can ask for additional searches to be made, and then can manipulate the additional lists of texts thus generated by requesting further lists to be created based on Boolean combinations of the preceding lists (e.g., the new list to include all the texts on list "A" and also on list "B" but to exclude any which appear on list "C", etc.).
An example of this type of information retrieval system is DIALOG.TM., where the user can make additional search requests, and create new lists of texts based on Boolean combinations of preceding lists.
Another example of this type of information retrieval system is the LEXIS.TM. system wherein a user can modify his/her search request in an effort to narrow down the number of cases (texts) developed from the initial search request.
These methods, Boolean combinations of lists and sequential screening or printouts of the texts themselves or of some subset of the information available about each text, generally constitute the state of the art in information retrieval at this time, for the phase of the retrieval process extending between the point at which the retrieval system has identified a group of texts as being responsive to the search criterion, and the point at which the user chooses and is presented with the individual texts which he judges to be actually germane to his needs.
In addition to other advantages, the present invention solves the problems described above, by making it possible for the user to see at a glance a break-down of the types of information contained in the texts selected by his initial request. From the generated display, the user can choose the texts which are relevant to his true interest both easily and quickly.
The present invention also relates to a system and method for identifying words in a target word list which are similar to a source word, and/or for identifying phrases or sentences in a target population which are similar to a source phrase or sentence.
Computer programs are used in a number of contexts to obtain words which are "similar" to some given source word, most notably in indexing and information retrieval programs and in spelling checkers. In indexing and retrieval programs, the purpose of such a search for "similar" words is to provide a more exhaustive list of terms related to the input word, such as plurals or forms modified by prefixes or suffixes. In the case of spelling checkers, the purpose is to be able to make a suggestion as to the most likely word the user had intended, once a word is encountered which does not appear in the program's dictionary.
In the traditional and simplest solution to this problem, most often used in indexing and retrieval programs, the user specifies the exact nature of the relationship between the source word and the words being sought by means of "wild card" symbols, most typically the `?` and `*` characters. In this protocol, the user instructs the program exactly which parts of a word he is interested in matching, and in which parts other characters may appear, the question mark `?` being used to signify any individual character and the asterisk `*` any sequence of characters. Thus, by way of example, the user would ask for "law*" if he intended to find words like "laws", "lawyer" or "lawless". Or a search for "analy?e" could be used to locate both the American (with a `z`) and the British (with an `s`) spellings of the word.
In the case of spelling checkers, a more flexible approach is needed, since the user does not usually known that he has made a spelling mistake, nor does he know in advance the relationship between the way he thinks a word is spelled and the way it is spelled in fact. Most typically, spelling checkers locate "similar" words by first restricting the search to words beginning with the same letter as the misspelled words and then use a list of common spelling and typographical errors to find words which differ from the source word only by these letters.
An alternate approach used by spelling checkers is to convert the word to an approximate phonetic form, and then search a dictionary of such phonetic words, on the assumption that the user typically has a clearer idea of how the word sounds than of how it is spelled. This last approach is usually quite effective at finding spelling errors, though it suffers from the drawback of being unable to deal with typographical mistakes. This technique is therefore quite commonly combined with elements of the previously mentioned approach, in order to obtain a more comprehensive list of possible words.
Some information retrieval programs use the phonetic approach also: along with a regular index of words (or of keywords) in their textbase, they create a parallel index in which those same words are represented phonetically. Search requests are then converted to phonetic format and the attempt is made to locate the search words' phonetic translation in the phonetic index. An example of this is the COMPUMARK.TM. system which is used in searching for trademarks.
Regarding "similar sentences", the state-of-the-art is more simply described. There are complex systems which actually parse sentences into their component parts of speech and analyze the semantic relationships among those parts; however, the applicant is not aware of any retrieval systems in which the sequencing of the words in a search request (as distinguished from the identity of the search-request words and the specified logical relationships among them) is used to influence the choice of the texts to be retrieved, or the ordering or ranking of the texts once they are found.