Being able to find documents by entering one or more words related to their content is a feature that is readily available with search engines and in many applications today. Search engines typically work by comparing query terms received in a search query with an index of stored terms from documents in the data store.
More often than not, these engines or applications can determine the linguistic roots (“stems”) of searched terms and match them up against the stems of indexed terms. One of the challenges of these applications is the extra storage and processing time of managing the stems and mapping them back to the original terms as they actually appear in documents.
The typical approach of handling stems is that the stems are injected into the text that is being indexed at the same logical position where the associated term resides. For example, if the text being indexed is “The quickest foxes could easily jump over lazy dogs,” stems may be injected as follows: “The quickest<quick> foxes<fox> could<can> easily<easy> jump over lazy dogs<dog>.” (The terms within the brackets represent the stems of the terms just in front of them.) It is easy to see that the same stem could be injected many times in when indexing a single document even in a single paragraph or sentence. Each of these instances of the stem has to be indexed and stored, adding to the storage space that the index occupies and to the processing time the indexing itself requires.