1. Field of the Invention
This invention, generally, relates to the field of computer lexical acquisition and text analysis, and more specifically, the invention relates to finding affixes (prefixes, infixes, and suffixes) in words. Even more specifically, the preferred embodiment of the present invention provides a tool for automatically collecting affixes of a language from one or more documents, and for saving those affixes in a database for future use.
2. Background Art
Knowledge of affixes is important for analyzing existing words and also for producing new words. This knowledge helps to improve search results and to identify newly created words in text.
Affixes make it possible to group words into a canonical form. If we have well-complied prefixes and suffixes, and if we are given the word “play,” then we know “played,” “playing,” “plays,” “replay,” “replayed,” and so on are variants of “play.” Affixes also add meanings to existing words. For example “or” is a suffix and when it is attached to a verb, it means “a person who performs the action described by the verb.” Thus, we can interpret “actor” as “act” and “or,” and understand that “actor” means a person who acts. “Antiasthmatic” is composed of “anti” and “asthmatic” and means something works against “asthma.”
Many never-seen words are found in text due to the advances of new technologies and the creativity of human beings. A common way of generating new words is creating morphological variations of existing words, such as auto-inject-or and co-sponsor-ship. These newly created words that are unknown to the lexicon cause many problems for Natural Language Processing (NLP) systems.
However, if we can segment a new word into the stem and the affix(es), we can obtain linguistic information about the word such as the possible parts-of-speech and the meaning. Most present systems for morphological analysis and out-of-vocabulary handling require a precompiled list of affixes and morphological rules specifying how each affix can apply. It is very difficult and time-intensive to acquire a complete list of affixes of a language by hand.
Recently, efforts have been developing for automatically identifying morphemes. For instance, such efforts are disclosed in “Discovering Morphemic Suffixes A case Study In MDL Induction,” by Brent, et al, In Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale, Fla. (1995) (Brent, et al.); “Unsupervised Learning of Naïve Morphology with Genetic Algorithms,” by Dimitar Kazakov, Pages 105 to 112, Workshop Notes of the ECML/Mlnet Workshop on Empirical Learning of Natural Language Processing Tasks, Prague, Czech Republic (April 1997) (Kazakov); “Knowledge-free Induction of Inflectional Morphologies,” by Schone, et al. Proceedings of the North American chapter of the Association for Computational Linguistics (NAACL) (2001) (Schone, et al.); “Unsupervised Learning of Morphology Using a Novel Directed Search Algorthm: Taking the First Step,” by Snover, et al, Association for computational Linguistics (ACL-2002): Workshop on Morphological and Phonological Learning (Snover, et al.); “Unsupervised Learning of the Morphology of a Natural Language,” by John Goldsmith, Computational Linguistics, Volume 27, Number 2 (2001) (Goldsmith); and “Using Distributional Information to Discover Morphemes: An Automated Distribution-Driven Prefix Learner,” by Marco Baroni, presented at the Morphology Meeting, Vienna, Austria, February 2002 (Baroni).
More specifically, Brent, et al. describes a system aiming at finding the right set of suffixes based on Minimum Description Length (MDL). This system, though, requires a part-of-speech tagged corpus. Kazakov also describes a MDL-based suffix finding algorithm from a raw text. This algorithm uses a simple genetic algorithm with MDL as the fitness function. Goldsmith describes another MDL-based system that attempts to provide both a list of morphemes and an analysis of each word in a corpus. He considers every possible split of each word in a corpus, and uses MDL as well as triage procedures to eliminate inappropriate parses. Schone and Jurafsky employe many sophisticated post-hoc adjustments to obtain the right conflation sets for words. Snover, et al. describes a system for the unsupervised learning of morphological suffixes and stems from a list of the most common words in a language. It tries to detect the final stem and suffix break of each word.
The first stage for discovering prefixes or suffixes would be finding a good division of a word into prefix, stem and suffix. Previous approaches parse words into two pieces: prefix and stem, or stem and suffix, depending on their goals. The most common way to parse a word in these systems is referred to as “split-all”; that is, to consider all possible splits. In this case, there are w-1 possible splits for a word of w characters.
This method is simple but computationally expensive. For instance, Kazakov reports that the computation complexity of the algorithm limits the size of the input lists to hundreds of words, and the algorithm took eight and a half hours to find suffixes from 120 words on a Pentium 90 Mhz platform.
Previous approaches limit the length of an affix to reduce the size of the search space. Those systems, otherwise, become impractical due to too many possible parses of words and complicated computations of statistical information. For instance, Goldsmith limits the length of a suffix to five letters arguing that no grammatical morphemes require more than five letters in familiar languages. However, longer prefixes and suffixes are easily found in many documents. Some examples from biomedical documents are such prefixes as pharmaco, plethysmo and ventriculo and such suffixes as ectomy, ability and ogenesis.
Baroni aims at finding a set of prefixes from a corpus based on distributional cues, which include distribution, frequency and length of words and their substrings in the input data.
The prior art systems have a number of disadvantages and limitations. For instance, the prior art fails to discover both prefixes and suffixes at the same time, and the prior art does not automatically analyze documents to find infixes. The prior art splits a word only into two parts—“a stem”+“a suffix,” or “a prefix”+“a stem.” Thus, if a stem has a prefix, the system cannot discover the prefix. Likewise, if a stem has a suffix, the system cannot find the suffix.
Also, the prior art cannot discover nested affixes (more than one prefix or suffix, such as the prefix “radio-immuno” and the suffix “less-ness”). The limitation on the length of an affix in the prior art keeps the prior art from finding many affixes that appear in technical documents, for example documents that exist in the fields of medicine, engineering, etc. In addition, the prior art fails to find affixes containing non-alphabetic characters such as digits and hyphens. These kinds of affixes often appear in technical documents such as the biomedical domain and chemistry.