With the explosive growth in the volume of available on-line information, the efficiency of information retrieval (IR) systems becomes increasingly important. IR systems generally operate on a canonical representation of documents, called a “profile,” consisting of a list of indexing units. For text searching, the indexing units are typically words. The profiles are stored in an inverted index, enabling documents to be retrieved by matching the terms in a query phrase to the words in the index. Many IR applications have been developed. One example is the GURU system described by Maarek et al., in an article entitled “An Information Retrieval Approach for Automatically Constructing Software Libraries,” in IEEE Transactions on Software Engineering 17(8), pages 800–813 (August, 1991), which is incorporated herein by reference.
For efficient and thorough searching, it is desirable that variants of a given word, such as singular and plural forms of a noun, or different tenses of a verb, be mapped to the same indexing unit. In other words, a lexical analysis of the words should be invoked so that, ideally, all of them are represented by the same base word. The simplest tool for lexical analysis is a stemmer, which derives base words using ad hoc rules for stripping suffices and handling exceptional word forms. A more precise method is morphological analysis, using a dictionary and a set of declination rules to find the lexical base forms of the words in the document. The base form of a given word is referred to as its “lemma.”
English language morphology is simple enough so that even stemmers do an adequate job of analysis for most applications. Hebrew, however, like other Semitic languages, is highly synthetic and rich in variants. In standard Hebrew writing, not all of the vowels are represented, while several letters may represent either a vowel or a consonant. A given lexical root may be declined by insertion, deletion, substitution or affixation of letters. It is often difficult to determine which letters in a word belong to the lemma, and which have been added. For example, the Hebrew word mishtara can be analyzed correctly as any of:                Mishtara (police)        Mishtar+a (her regime)        Mi+shtar+a (from her bill)        
The result of this complex morphology is a high level of ambiguity, which cannot be resolved unequivocally without contextual information. Therefore, Hebrew morphological analyzers typically return multiple possible analyses for a given word. An example of a morphological analyzer with Hebrew capabilities is the POE LanguageWare system (version 2.6), offered by the IBM Software Solutions Division, of Research Triangle Park, N.C. For each legal Hebrew input string, this analyzer returns all legal lexical candidates as possible analyses of the given string, along with the following characteristics of each candidate:                Lemma—the base form used for indexing.        Category—categorization of the lemma according to part of speech, gender, plural inflections, legal set of prefixes and legal set of suffixes.        Part of speech.        Prefix—attached particles.        Correct form—(optional) the input word with additional vowel letters added to enable the given analysis.        Number, gender, person.        Status—(for non-verbs only)—whether this lemma is in a construct (nismach) or in its absolute (nifrad) form.        Tense—(for verbs only)        Conjugation pattern (binyan and gizra—for verbs only).        Inf_num, inf_gen, inf_person—number, gender and person of pronominal suffix, added to Hebrew words to indicate possessives or verb objects, for example.On average, this analyzer returns 2.15 analysis for each input string.        
A number of methods have been proposed for resolving the ambiguity of Hebrew morphological analysis. Most methods use contextual information. Levinger et al. describe a context-free method in an article entitled “Learning Morpho-Lexical Probabilities from an Untagged Corpus with an Application to Hebrew,” published in Computational Linguistics 2(3), pages 383–404 (1993), which is incorporated herein by reference. In studying a large Hebrew text base, the authors found that 55% of the words had more than one morphological reading, and 33% had more than two. They describe a method of disambiguation based on gathering statistics on the text base, so as to determine, for each word, a morpho-lexical probability for each of its alternative analyses, indicating the likelihood that the analysis is correct. An analysis of a given word that has a significantly higher probability than the alternatives is taken to be the correct one, regardless of the context and the form of the word.