1. Field of the Invention
The present invention relates generally to an improved data processing system, and in particular to a computer implemented method, an apparatus and a computer program product for unsupervised stemming schema learning and lexicon acquisition from corpora.
2. Description of the Related Art
A stem is the base or root of a word. The stem may be combined with any derivational affixes, to which inflectional affixes may be added to form a final form of a word. A stem consists, at a minimum, of a root, but may also be analyzable into a root form and additional derivational morphemes. A stem may require an inflectional operation, such as adding a prefix or a suffix to make the stem into a fully understandable word. If a stem does not occur by itself in a meaningful way in a language, it is referred to as a bound morpheme.
For example, in the English language, the simple verb “tie” is a stem and cannot be further reduced. Adding the suffix “s” to the stem forms “ties” and adding a prefix “un” forms “unties.” An affix is a morpheme added to a word to change its function or meaning. Two typical ways to do this include use of a prefix, adding a morpheme to the beginning of a word and suffix, and adding morpheme to the end of a word as just shown.
Stemming is a process of reducing a word to a primitive base form, by peeling away instances of prefixes and suffix usage. Morphological analysis comprises stripping away or removal of prefixes and suffixes to reduce a word to its simple basic form or root. Reduction of a word typically involves application of various rules implied by morphology. Morphology, a branch of linguistics, studies patterns of word-formation within and across languages. Based on analysis of the patterns, rules may be formulated to model the formation of words in a given language.
In many languages, words are related to other words by rules. The study of the rules is called morphology. For example, the words “dog”, and “dogs”, are closely related, apparently having the same visible root. People recognize these relations from their knowledge of the rules of word-formation for the English language. This observed “rule” may be applied to other words to show a similar relationship, such as “dish” and “dishes.” The rules then reflect specific patterns of use based on the manner in which words are generated from pieces.
Familiarity with high-frequency affixes and roots promotes comprehension of numerous words in which they occur, as meaningful chunks. Morphology is typically used to analyze a body of text, called a corpus, to form a lexicon. When there is more than one body of text to be analyzed, the collection is referred to as corpora. A lexicon is an ordered collection of words and associated definitions. A dictionary is an example of a lexicon. Typically, morphology requires the careful analysis of words within a context according to established rules by linguists using prior learned knowledge and lexical resources. The process is time consuming, laborious, and costly.
Therefore, it would be advantageous to have a method, apparatus, and computer program product for stemming words in a manner that overcomes some or all of the problems discussed above.