As the amount of information available electronically increases, there is a corresponding need to improve the way in which users are able to locate information of interest. Various information retrieval systems, including a number of search engines, enable users to search large collections of documents, pages, files, and other such groupings of content. Such systems can enable the searching of documents on a variety of sources, such as a user's own computer, a Web site, a data repository hosted across a network, or any other such source. Such systems typically create an index of the documents by determining words or phrases contained in those documents. When a search query is subsequently received, the system can compare the keywords in the query to the words in the index to find matching documents. For example, a user might submit a query to an electronic marketplace when performing a product search, and keywords in that query can be compared to an index including information for products offered through that electronic marketplace.
In order for such a process to provide accurate results, the keywords in a query must be accurately matched with the corresponding words in the index. Such matching is not straightforward, however, as many words have different forms, and it is often the case that the user will be interested in receiving results that have other forms of a word. For example, a user searching for the word “computers” is likely also interested in documents that contain the word “computer.” As these words are not exactly the same in form, there will not be an exact match. To address the variation in word forms, conventional search engines utilize a module referred to generally as a “stemmer,” which replaces an inflected or surface form of a word with its root form (since most languages use inflected forms or words to indicate grammatical properties such as case or tense). Stemming generally is applied both at index time, to transform the words in the documents into the respective stemmed form, and at query time, to transform the words in a received query. Using such an approach, words in queries will generally share a canonical form with the words in the index. Stemming is especially important in highly-inflected languages such as German, which has more noun cases than English and also inflects adjectives.
In many instances, the dictionary of words that must be analyzed is ever increasing. For example, an electronic marketplace that is continually offering new products will have a continually increasing number of words that need to be indexed, such as may correspond to new product names and manufacturers. Conventional stemmers do not adequately handle this ever increasing vocabulary. For example, certain stemmers (e.g., heuristic stemmers) apply a series of rules indicating how words should be transformed. Such an approach is not optimal, however, as the interactions of these rules and various exceptions becomes increasingly complicated, particularly as additional exceptions are handled and new sets of rules are created. Other stemmers (e.g., table-driven stemmers) utilize a type of lookup table that maps inflected forms to root forms for all known words, but such an approach requires an exhaustive and authoritative list of words and their inflected forms as input. As the number of words increases, there is a corresponding need to update each table with the appropriate new vocabulary.