Search engines and document classification systems typically rely on the analysis of source document text in the construction of their search indices or classifiers and for subsequent query processing. Morphological analysis of languages can be used to support computer-based applications, such as natural-language text processing required by search engines or machine learning classifiers, by decomposing text input into affixes and lemmas. A common approach to morphological analysis relies on a small number of possible affixes, so that all possible forms for every word can be easily and efficiently synthesized. In languages such as English, French, Spanish, and other European languages, only one affix may be attached to a word in most cases. These affixes may be maintained in a file or database together with a description of their morpho-syntactic meaning. These affix-lists are used in conjunction with predefined lexicons to analyze the words.
Semitic languages such as Hebrew and Arabic are characterized, in most cases, by three-letter roots from which various parts of speech may be formed. For example, the Hebrew root k-t-v, meaning “write,” may take various forms such as                Ka-ta-v-nu—we wrote        Ka-ta-va—article        Ka-ta-v—he wrote; journalist        Ni-k-to-v—we will write.        
Semitic languages are characterized by prodigious word formation, where words are typically formed by applying predefined patterns of vowels, consonants, prefixes, suffixes, and infixes to roots making the näive approach mentioned above unsatisfactory for automated word analysis of such languages.
Usually, a root includes mostly consonants, while a pattern is expressed as a combination of vowels and consonants, typically represented by special diacritic signs. Thus, in the above example, the following words may be formed from the Hebrew k-t-v root by applying various patterns as follows:                Ka-ta-v-nu (we wrote)=k-t-v+Pattern CaCaCnu        Ka-ta-va (article)=k-t-v+Pattern CaCaCa        Ka-ta-v (he wrote; journalist)=k-t-v+Pattern CaCaC        Ni-k-to-v (we will write)=k-t-v+Pattern NiCCoCwhere each “C” represents a different letter of the root in the order in which the letters appear in the root. The standard method of Natural Language Processing used for European languages is based on a relatively small set of morphological rules. However, the set of rules required for Semitic languages may be too large and cannot be exhaustive to cover all possible morphological variants.        
In one type of analysis known as “stemming”, affixes are removed from a word, typically until all affixes have been removed from the word, leaving just the word stem or root. When stemming is applied to the words in a source text document, the identified stems are often used for indexing, classification, or query processing, in place of or in addition to the words themselves.
In Semitic Languages each word may have a large, though limited, number of affixes, and most importantly, these affixes can be appended to each other in a certain order. Unfortunately, as consonants can be either part of a root or an affix, a list of affixes cannot by itself be applied to determine whether a letter is an affix or part of the root. This is illustrated by the following ebresHebrew word (consonants in bold letters):
sha-ba-t—“Saturday”
sha-va-t—“he was striking (participated in a strike)”
she-ba-t—“that daughter”—in this case “she” is a prefix meaning “that” (a relative pronoun)
Since in Hebrew, for example, the diacritic signs representing vowels usually are not written, the same combination of letters (i.e. words) may have various meanings depending on the context.