The invention relates to identifying one of a number of groups of words.
U.S. Pat. No. 5,551,049 discloses a technique in which information about one of a number of groups of words is determined by matching an identifier of a received word. The identifier is compared with word identifiers grouped in sequence to represent synonym groups. When a match is found, the group of identifiers that includes the matching identifier is used to obtain synonym group that includes the received word.
Frakes, W. B., xe2x80x9cStemming Algorithmsxe2x80x9d, in Frakes, W. B., and Baeza-Yates, R., Eds., Information Retrieval, Prentice Hall, 1992, pp. 131-160, discloses a taxonomy for stemming algorithms that includes several automatic approaches. Affix removal algorithms remove suffixes and/or prefixes to obtain a stem, while table lookup methods perform lookup in a table in which terms and their corresponding stems are stored. Affix removal algorithms such as the Porter stemmer can, after removing characters according to a set of replacement rules, perform recoding, a context sensitive transformation, to change characters of the stem.
Hull, D. A., xe2x80x9cStemming Algorithms: A Case Study for Detailed Evaluationxe2x80x9d, Journal of the American Society for Information Science, Vol. 47, No. 1, 1996, pp. 70-84, discloses a lexical database that can analyze and generate inflectional and derivational morphology. The inflectional database reduces each surface word to a dictionary form, while the derivational database reduces surface forms to stems that relate to the original in both form and semantics. The databases are constructed using finite state transducers (FSTs), which allows the conflation process to act in reverse, generating all conceivable surface forms from a single base form.
Salton, G., and McGill, M. J., Introduction to Modem Information Retrieval, New York: McGraw-Hill, 1983, pp. 75-84 disclose information retrieval techniques that use a thesaurus. A thesaurus can be used, for example, in an automatic indexing environment, and when a document contains a term such as xe2x80x9csuperconductivityxe2x80x9d or (stem xe2x80x9csuperconductxe2x80x9d), that term may be replaced by a class identifier for a class of words with related meanings. The same operation can be used for a user query containing a word in the class. Should the document contain xe2x80x9csuperconductivityxe2x80x9d while the query term is xe2x80x9ccryogenicxe2x80x9d, a term match would result through thesaurus transformation. Rather than replacing an initial term with the corresponding class identifier, the thesaurus class identifier can be added to the original term.
The invention addresses basic problems that arise in using a word, referred to herein as a xe2x80x9cquery wordxe2x80x9d, to identify one of a number of groups of words. With conventional table lookup techniques, FST techniques, and other techniques that rely on matching the query word, an unknown query word will result in a failure. Affix algorithms like the Porter stemmer can stem any word, including an unknown query word, but conventional information retrieval techniques based on affix algorithms do not provide a fallback strategy if the first obtained stem does not relate the unknown word to any other word.
The invention is based on the discovery of a new technique for using a query word to identify one of a number of word groups. The technique first determines whether the query word is in any of the groups. If not, the technique attempts to modify the query word in accordance with successive suffix relationships in a sequence until it obtains a modified query word that is in one of the groups. The new technique in effect provides two stages of word group lookupxe2x80x94the first stage can be implemented by rapidly comparing a query word with the words in the groups, while the second stage is slower, because it includes attempting to obtain modified query words in accordance with suffix relationships.
The new technique can be implemented with an ordered list defining the sequence of suffix relationships, such as pairwise relationships. An attempt can be made to modify the query word in accordance with each relationship in the list until a modified query word is obtained that is in one of the groups. The relationships in the list can be ordered in accordance with their frequency of occurrence in a natural language, and modifications can be attempted beginning with the highest frequency suffix relationship. The ordered list can be automatically obtained as part of an automatic technique for producing the word groups.
The new technique can be implemented to attempt modifications of the query word iteratively, with each iteration attempting to modify the query word in accordance with a respective suffix relationship. If an iteration obtains a modified query word, it also determines whether the modified query word is in any of the word groups.
When a modified query word is obtained that is in one of the word groups, information identifying the word group can be provided, such as a representative of the group or a list of words in the group.
The new technique can further be implemented in a system that includes a query word and a processor that determines whether the query word is in any of the word groups. If not, the processor attempts to modify the query word in accordance with successive suffix relationships in a sequence until a modified query word is obtained that is in one of the word groups. The system can also include stored word group data indicating the word groups, and the word group data can be an FST data structure. The system can also include stored suffix relationship sequence data indicating the sequence of suffix relationships.
The new technique can also be implemented in an article of manufacture for use in a system that includes a storage medium access device. The article can include a storage medium and instruction data stored by the storage medium. The system""s processor, in executing the instructions indicated by the instruction data, determines whether the query word is in any of the word groups. If not, the processor attempts to modify the query word in accordance with successive suffix relationships in a sequence until a modified query word is obtained that is in one of the word groups.
The new technique can also be implemented in a method of operating a first machine to transfer data to a second over a network, with the transferred data including instruction data as described above.
The new technique is advantageous because it allows more robust word group identification than conventional techniques. In comparison with conventional word matching techniques, the new technique is advantageous because it allows the use of an unknown word or a word that does not exactly match, which can arise, for example, when a user provides a query word that is not in any of the word groups. In comparison with conventional affix removal techniques, the new technique can continue to obtain modified query words until it finds one that is in one of the word groups, so that it does not fail if the first modified query word is not in any of the word groups.
The new technique is also advantageous because it allows for fully automatic implementation. One fully automatic implementation automatically obtains an ordered list of suffix relationships in automatically producing a word group data structure.
The following description, the drawings, and the claims further set forth these and other aspects, objects, features, and advantages of the invention.