The present invention relates to an information retrieval system and method which analyzes and summarizes the information contained in a group of texts and identifies similar words and word collections.
"Information retrieval" is the process of selecting and presenting specific items from within a large and heterogeneous collection of texts, according to users' descriptions of the subjects in which they are interested.
Some information retrieval systems index all the words appearing in all the texts, others index "keywords" which are descriptors assigned to each text by the text's author or by someone else. In both cases the user who wants to find a text does so by asking for a search on a particular word, or on logical (Boolean) combinations of words, or on words with some maximum distance (or similar relationship) between them in the texts, etc. In addition to requesting a specific word or words, most systems allow the user to search for a character string; e.g., LEXIS.TM. and DIALOG.TM..
A typical search request, on traditional systems, generates a long list or a large collection of texts all of which logically satisfy the search criterion, but only a small percentage of which will actually be of use. The user is forced to expend much time and much energy winnowing (searching) through the texts found by the system, to pick out those truly relevant to his needs.
This problem originates from the fact that the user typically does not have EXACT knowledge in advance of how the subjects of interest to him will have been described.
*If his description is very specific, he will lose information: anything relevant to his needs but described in a slightly different manner will not be found by the system. PA1 *If his description is very general, many irrelevant texts will be found also, and the winnowing process will be costly, time consuming, and tiring. PA1 *On the level of office systems and personal computers, despite the proliferation of computers and the wide use of STRUCTURED databases, the use of personal and interpersonal catch-all text-based information retrieval systems is almost unknown--the bother and the overhead involved in using traditional systems are too great to make the effort worth the trouble. PA1 *On the level of massive public data-bases, most bibliographic information systems have attempted to solve the problem by limiting users to a predetermined vocabulary of acceptable keywords. (Users have a reasonable chance of guessing what their subject will have been called, since both users and authors are confined to that published list of keyword possibilities.) This solution has been workable but at a price:
For this reason:
(1) To be an effective user of such databases one must study and develop expertise in the use of the system. They are, thus, inaccessible to untrained users, and inappropriate for casual use. PA2 (2) Because of their rigid structure, such systems are of limited use (and indeed are little used) in dynamic environments such as would be found, for example, in the case of an unstructured corporation-wide catch-all collection of information.
Two general types of information retrieval systems and methods currently in use are as follows:
(1) In a first method, once the information retrieval systems (whatever their selection methodology) have isolated or identified the group of texts which satisfy the user's search criterion, the systems present the user with a count of the number of texts within that group, and the opportunity of sequentially reviewing the texts which are members of that group.
The user either looks through the texts themselves, one by one, or looks through sequential listings of some part of the information available about each text: that is, the user may choose to review a sequential list of the titles of the texts, or abstracts of the texts, or lists of keywords of the texts, or the initial paragraphs of the texts, or the dates and origins of the texts, or some combinations of the above. The user is then given some method of specifying (usually by number) those texts for which he wishes fuller information, printouts, etc.
An example of this type of information retrieval system is the DIALOG.TM. information retrieval system. DIALOG.TM. provides a user with the number of records (texts) satisfying the search request. The user can then request that any or all of the records be displayed and/or printed in any one of a number of formats containing varying and differing amounts of information.
(2) A second method is generally used when the number of texts presented by an initial search is too large, or the original search criterion was too general, to make it practical for the user to look through sequential listings to pick out the texts he wants. This method is essentially an extension of the original boolean search facility: the user can ask for additional searches to be made, and then can manipulate the additional lists of texts thus generated by requesting further lists to be created based on Boolean combinations of the preceding lists (e.g., the new list to include all the texts on list "A" and also on list "B" but to exclude any which appear on list "C", etc.).
An example of this type of information retrieval system is DIALOG.TM., where the user can make additional search requests, and create new lists of texts based on Boolean combinations of preceding lists.
Another example of this type of information retrieval system is the LEXIS.TM. system wherein a user can modify his/her search request in an effort to narrow down the number of cases (texts) developed from the initial search request.
These methods, Boolean combinations of lists and sequential screening or printouts of the texts themselves or of some subset of the information available about each text, generally constitute the state of the art in information retrieval at this time, for the phase of the retrieval process extending between the point at which the retrieval system has identified a group of texts as being responsive to the search criterion, and the point at which the user chooses and is presented with the individual texts which he judges to be actually germane to his needs.
In addition to other advantages, the present invention solves the problems described above, by making it possible for the user to see at a glance a break-down of the types of information contained in the texts selected by his initial request. From the generated display, the user can choose the texts which are relevant to his true interest both easily and quickly.
The present invention also relates to a system and method for identifying words in a target word list which are similar to a source word, and/or for identifying phrases or sentences in a target population which are similar to a source phrase or sentence.
Computer programs are used in a number of contexts to obtain words which are "similar" to some given source word, most notably in indexing and information retrieval programs and in spelling checkers. In indexing and retrieval programs, the purpose of such a search for "similar" words is to provide a more exhaustive list of terms related to the input word, such as plurals or forms modified by prefixes or suffixes. In the case of spelling checkers, the purpose is to be able to make a suggestion as to the most likely word the user had intended, once a word is encountered which does not appear in the program's dictionary.
In the traditional and simplest solution to this problem, most often used in indexing and retrieval programs, the user specifies the exact nature of the relationship between the source word and the words being sought by means of "wild card" symbols, most typically the `.fwdarw.` and `*` characters. In this protocol, the user instructs the program exactly which parts of a word he is interested in matching, and in which parts other characters may appear, the question mark `?` being used to signify any individual character and the asterisk `*` any sequence of characters. Thus, by way of example, the user would ask for "law*" if he intended to find words like "laws", "lawyer" or "lawless". Or a search for "analy?e" could be used to locate both the American (with a `z`) and the British (with an `s`) spellings of the word.
In the case of spelling checkers, a more flexible approach is needed, since the user does not usually know that he has made a spelling mistake, nor does he know in advance the relationship between the way he thinks a word is spelled and the way it is spelled in fact. Most typically, spelling checkers locate "similar" words by first restricting the search to words beginning with the same letter as the misspelled words and then use a list of common spelling and typographical errors to find words which differ from the source word only by these letters.
An alternate approach used by spelling checkers is to convert the word to an approximate phonetic form, and then search a dictionary of such phonetic words, on the assumption that the user typically has a clearer idea of how the word sounds than of how it is spelled. This last approach is usually quite effective at finding spelling errors, though it suffers from the drawback of being unable to deal with typographical mistakes. This technique is therefore quite commonly combined with elements of the previously mentioned approach, in order to obtain a more comprehensive list of possible words.
Some information retrieval programs use the phonetic approach also: along with a regular index of words (or of keywords) in their textbase, they create a parallel index in which those same words are represented phonetically. Search requests are then converted to phonetic format and the attempt is made to locate the search words' phonetic translation in the phonetic index. An example of this is the COMPUMARK.TM. system which is used in searching for trademarks.
Regarding "similar sentences", the state-of-the-art is more simply described. There are complex systems which actually parse sentences into their component parts of speech and analyze the semantic relationships among those parts; however, the applicant is not aware of any retrieval systems in which the sequencing of the words in a search request (as distinguished from the identity of the search-request words and the specified logical relationships among them) is used to influence the choice of the texts to be retrieved, or the ordering or ranking of the texts once they are found.
SUMMARY OF THE INVENTION
The analyzing and summarizing aspect of the invention makes explicit the inherent relationships among a group of texts with associated keyword descriptions, by analyzing the keywords held in common by subgroups of texts within the overall group. The invention comes into play once a group of texts has been selected using standard search methodology -- at the point at which the user would either have to make further guesses as to how to narrow down his search criterion, or would be presented with a sequence of texts that would then have to be "winnowed through."
The invention is a system and method of analyzing and of presenting the informational content of this group of texts, as a group. The user sees presented on a display medium (screen) the equivalent of an annotated "TABLE OF CONTENTS," organized as a standard outline or in some similarly graphic format, analyzing that group of texts into major subject areas, subcategories, sub-sub-categories, etc. Each "TABLE 0F CONTENTS" outline is dynamically generated in response to specific search requests, and constitutes a kind of "birds'-eye view" of the contents of the textbase in that subject area at that time. For the user looking for a specific kind of information of which he has only a general description (that's the typical case), a glance at the table of contents, a matter of seconds, usually suffices to eliminate from consideration most of the irrelevant material. Relevant sub-categories usually are immediately evident. If necessary, the user can pick out for further analysis (in one implementation just by moving the cursor on the screen) a much-reduced group of texts (one of the categories presented to him in the table of contents) and repeat the analysis process, creating another table of contents, this time of the sub-category. One or two iterations will usually suffice, even when starting with a group of hundreds of texts, to get to a table of contents in which most descriptions will be of individual specific texts rather than of groups of texts. With an appropriate command; e.g., by moving a cursor and pressing a key, the user chooses the specific text or texts he wants to see according to the descriptions he sees on the table of contents, and with an appropriate command; e.g., a keystroke, brings those texts to the screen or sends them to be printed.
It is anticipated that this technique will lead to an extension of the use of information retrieval into areas where it had not been convenient or practical to use it before, some examples being as follows:
(a) Specific information can be located much more rapidly than had been possible using prior technology. Experience so far has shown that the time needed for finding specific information in a large textbase is on the order of 10% of what it normally would take, and the process is far more agreeable.
(b) It is practical to search for information whenever the user knows some general characteristics of what he's looking for, even though he has no idea how that information may have been specifically described in the textbase.
(c) It is practical to maintain and use large heterogeneous collections of textual information, and to do searches to find specific elements of that information, without limiting authors to a predetermined lexicon of "keywords." This means that users (e.g., in a corporate environment) can choose keywords spontaneously and still succeed in finding relevant information among each others' entries.
(d) "Browsing" the textbase becomes a pleasant and meaningful operation, quite different from paging through texts or reading the thesaurus, which are the only "browsing" techniques available in traditional technology.
In one implementation of the invention, the information retrieval system is coupled with a word-processor, for convenience in entering texts into the textbase, and with an output screen presenting the results of the above analysis in traditional outline format. The output screen shows the categories and subcategories of subjects found to be included in the texts selected as a result of the original search request, to any desired level of detail. The user moves the screen cursor to point to a category on the outline for which he wants a more detailed break-down, and the process continues until individual texts are being referenced on the outline. Then by pressing a key the user can direct the system to send the chosen text to the printer or bring it to the screen.
The significance of the invention is that the amount of time needed for the user to isolate texts of interest to him, from among groups of texts which satisfy his initial search request but are in fact irrelevant to him, is reduced by a large factor. The information retrieval process is made more convenient thereby; it is practical using this system to find specific texts with only the most minimal initial information as to how they may have been keyworded; and various practical constraints which have restricted the ways in which textbases needed to be organized in order to guarantee that stored information could be found again, can be relaxed.
The invention also relates to a process which enables the computer to locate "similar" words in a manner more flexible and more exhaustive than any currently used technique, so far as we know. In particular, the invention does not require any specification by the user as to the relationship between the input word and the target words, nor does it rely on phonetic translation or any restrictive list of typical mistakes. The invention rather makes use of the actual structure of the word itself, and searches for words which have a similar structure or which include a similar structure as part of a larger structure. The invention is therefore able to locate a far more comprehensive list of "similar" words than is the case with other techniques.
The structure of the input word is analyzed in terms of groups of letters, starting with letter pairs and working up to larger groups, and accords to any word in the target dictionary which contains these letter groupings a number of points determined by the size of the group and/or its location in the word. Words which are given a large number of points by the process are then presented to the user, in descending order of the number of points allocated, for his selection.
In the case of searching for similar sentences rather than similar words, the technique is identical, except that groups of words, rather than groups of letters, are compared. One field of application for this invention is in information retrieval systems, where the user presents his search request in the form of a phrase or sentence, and texts are selected from the data-base and/or prioritized, according to the scores achieved when either their descriptions (keywords, title, abstract) or the texts themselves are evaluated according to this method. Since in information retrieval systems the typical search request finds many texts which are, in fact, irrelevant to the user, the invention, when employed to automatically winnow and/or prioritize texts can save time and trouble for users of the system.
These and various other advantages and features of novelty which characterize the invention are pointed out with particularity in the claims annexed hereto and forming a part hereof. However, for a better understanding of the invention, it advantages, and objects obtained by its use, reference should be made to the drawings which form a further part hereof, and to the accompanying descriptive matter, in which there is illustrated and described a preferred embodiment of the invention.