This invention relates to computerized information retrieval devices or systems, for text indexing and retrieval, databases for use in such information retrieval devices, and methods for making such databases.
All natural languages allow for common elements of meaning to be systematically represented by words that appear in different forms in free text. For example, in English, the common meaning of "arrive" is carried by the inflectional variants "arrived", "arrives", and "arriving" (as well as "arrive" itself), and by the derivational variant "arrival". The base word indicating the common element of meanings for all such variants is often called the stem, and the morphological analysis process of determining the stem from a variant form is often called stemming. The process of going the other way, from a stem to all its variant forms, is often called synthesis or generation.
Stemming can play an important role in full-text indexing and retrieval of natural language documents. Users who are primarily interested in indexing and retrieving text passages from documents according to their meanings may not want the variants of a common stem to be distinguished. Thus, if the user enters a query with the word "arriving", this can be treated as matching text passages that contain any of the words "arrive", "arrives", etc. This would have the important effect of improving recall without sacrificing precision.
However, stemming in the context of text indexing and retrieval has proven difficult to implement well, even for morphologically simple languages like English. Conventional techniques for English (e.g., "Development of a Stemming Algorithm" by J. B Lovins, Mechanical Translation and Computational Linguistics 11, pp. 22-31, March 1968; "An Algorithm for Suffix Stripping" by M. F. Porter; Program 14, No. 3, pp. 130-137, July 1980.) use "tail-cropping" algorithms to map words into canonical forms such as stems. Thus, rules or algorithms are written that strip off any of the systematically varying suffix letters to map every word to the longest prefix common to all variants. All the forms of "arrive", for example, would be stripped back to the string "arriv" (without an e), since this is the longest prefix that all the forms have in common (because the "e" does not appear in either "arriving" or "arrival"). Without special mechanisms to deal with exceptions, this strategy would also convert all the forms of "swim" back to "sw", since that is the longest invariant prefix.
This conventional strategy has several disadvantages. First, the resulting stem strings are frequently not words of English (sw, arriv). They cannot be presented to a naive user who might want to confirm that his query is being interpreted in a sensible way.
Second, this approach requires special mechanisms to deal with irregular inflection and derivation. Typically, an exception dictionary is provided for the most obvious and common cases (e.g. hear.fwdarw.heard, good.fwdarw.better), but the number of entries in this dictionary is usually restricted so that the resulting data-structures do not become too large. Accuracy suffers without a complete treatment of exceptional behavior, to the point where some researchers have concluded that stemming cannot significantly improve recall without substantially reducing precision.
Third, this is a one-way approach. It is good for mapping from words to stems, but cannot generate all the variant forms from a given stem. It provides appropriate search behavior if the stemming algorithm can be applied not only to the query, but also to the document texts themselves, either in advance of search in order to build an index or else on-the-fly so that the query-stem can be matched against the stemmed text-words. Thus, it is limited in its applicability to situations where the document database can be preprocessed for indexing (which would have to be redone whenever improvements are made to the stemmer) or where the time-penalty for stemming on the fly is not prohibitive.
Finally, this technique is not linguistically general. Entirely different algorithms would have to be written for each natural language, and even the general strategy of such algorithms must change to handle the properties of prefixing and infixing languages.