A significant problem encountered in text searching and retrieval systems is how to maximize the relevance and accuracy of the hits returned in response to a query. This problem is made difficult by the fact that in essentially all languages, words are made up by modifying root words to indicate such things as plurals, tense, verbs, nouns and various other parts of speech. In English, and many other languages this is done by attaching suffixes to the word so that, for example, the words teach, teachings, teachers etc. are all variations of the root word teach. When a user issues a query such as “find me all documents about teaching” the query system cannot simply look for the word teaching otherwise most of the relevant documents may not be found. It must instead try to find all word variants derived from the root “teach” and return those documents in which any such variant occurs.
There are two basic strategies available to do this. The first strategy is to maintain a dictionary of every variant of every word in the language and when the “teaching” query is issued, translate it into a more complex query looking for every variant. The second strategy is to incorporate a stemming algorithm into both the querying and the indexing process such that all word variants based on the same root are stemmed (that is they have their suffixes stripped) and indexed/queried via the root word. For example, the “teaching” query is stemmed to “teach” and it is hoped that due to the identical stemming in the indexing phase, this will retrieve all the relevant documents. The problem with the huge dictionary approach is that because there are an almost infinite variety of word forms in most languages, these dictionaries can be completely unmanageable. In addition, the corresponding queries must contain large numbers of “OR” terms which therefore run slowly resulting in unacceptable performance. For this reason, virtually all text search systems have adopted the stemming approach since it can be made very fast and efficient.
Stemming algorithms today utilize knowledge of the specific language involved in order to recognize and strip off probable suffixes from a word. Thus the simplest possible stemming algorithm in English (referred to as the “Simple” stemmer shown below) might be implemented as follows:
if ( word ends in “ies” and (word does not end in “eies” or “aies”) then  replace “ies” by “y”else if ( word ends in “es” and (word does not end in “aes” or “oes”) then  replace “es” by “e”else if ( word ends in “s” and ( word does not end in “us” or “ss”) then  remove the “s”else  no action
Even the trivially simple algorithm above contains a great deal of English specific knowledge such as the fact that “ies” is usually the plural of a word ending in “y.” There are a number of well know stemming algorithms that apply to the English language, the most prevalent and famous of which is the “Porter” stemmer (described by M. F. Porter in 1980). Other English language stemming algorithms include the “Krovetz” inflectional stemmer (1993), and the “Lovins” stemmer (1968) which has two variants (the “Iterated Lovins” stemmer is used below). There are a number of less well know stemming algorithms, though none has measurably improved on that of Porter. All of these stemming algorithms embody very detailed knowledge of the English language in order to make appropriate substitutions, and all of them operate by suffix stripping only.
There is a need therefore in the art for a single stemming algorithm that can be applied to all known languages regardless of script system, without the need to modify the code in any way, allowing it to be adapted for a language by any non-technical person fluent in that language, which has an accuracy near 100% in all languages, which can conflate roots between languages to form a single searchable root set over all languages, and which is not hopelessly slow in order to achieve all this. Such a stemming algorithm could revolutionize text searching as currently practiced. The stemming framework and stemming algorithm disclosed herein together meet these and other needs in the art.