The present invention relates generally to a method and apparatus for learning the morphology of a natural language. More particularly, the present invention relates to unsupervised acquisition, particularly using automatic means such as a computer, of the morphology of languages, particularly European languages.
The morphology of a language describes the structure of the words which form that language. The structure consists, generally, of stems or roots along with affixes such as prefixes and suffixes which modify the stem. Morphology involves collecting and organizing stems and associated affixes.
Development of morphologies of languages is a desirable goal. Previously, a morphology for a language has been developed by a trained linguist working manually to identify the appropriate stems and affixes and other structural features of a language. Such a project requires several man-weeks or more to accomplish. A better solution would involve unsupervised learning by automatic means, such as a programmed general purpose computer operating only on an input of a large corpus of a language.
Developing an unsupervised learner using raw text as its sole input offers several attractive aspects, both theoretical and practical. At its most theoretical, unsupervised learning constitutes a (partial) linguistic theory, producing a completely explicit relationship between data and analysis of that data. A tradition of considerable age in linguistic theory sees the ultimate justification of an analysis A of any single language L as residing in our ability to show that analysis A derives from a particular linguistic theory LT, and that LT works properly across a range of languages (that is, not just for language L).
The development of a fully automated morphology-generating tool would be of considerable interest. Good morphologies of many European languages are yet to be developed. With the advent of considerable historical text available online (such as the ARTFL data base of historical French), it is of great interest to develop morphologies of particular stages of a language. The language, and the process of automatic morphology-writing can simplify this processxe2x80x94where there are no native-speakers availablexe2x80x94considerably. Such a system can also be used as a stemmer in the context of an information retrieval system, identifying stems, and multiplets of related stems.
A third motivation for developing such a system is that it can serve as a preparatory phase for an unsupervised grammar acquisition system. A significant proportion of the words in a large corpus can be assigned to categories, though the labels that are assigned by the morphological analysis are corpus-internal; nonetheless, the assignment of words into distinct morphologically motivated categories can be of great service to a syntax-acquisition device.
Development of language morphologies has value outside of purely linguistic endeavors. In many word processing programs, spell checking routines cannot properly recognize all forms of some words. As such routines are expanded to include new languages, the morphology of the new languages is necessary to efficiently implement the routine. The same is true of speech recognition routines.
There has been much less work done on automatic acquisition in the area of morphology than in the areas of syntactic analysis or part of speech tagging. Several researchers have explored the morphophonologies of natural language in the context of two level systems in the style of the model developed by Kimmo Koskenniemi [Koskenniemi, Kimmo,. Two-level Morphology: A General Computational Model for Word-form Recognition and Production, Publication no. 11, Department of General Linguistics, Helsinki, University of Helsinki, (1983)], Lauri Karttunen [Karttunen, Lauri. 1993. Finite State Constraints. In John Goldsmith (ed.), The Last Phonological Rule, pp. 173-194, Chicago, University of Chicago Press (1993)],M. R. Brent [Brent, M. R., Minimal Generative Models: A Middle Ground Between Neurons and triggers, in Proceedings of the 15th Annual Conference of the Cognitive Science Society, pages 28-36,Hillsdale, N.J., Lawrence Erlbaum Associates] and others. Morphophonology is the study of the changes in the realization of a given morpheme that are dependent on the context in which it appears. This is for the most part the problem of the treatment of regular allomorphy (also known as automatic morphophonology), that is, the treatment of the principles that determine how two surface-forms may be accounted for as the realization of a single string of characters, but subject to rules modifying those characters in particular contexts.
One closely related effort is that attributed to Andreev [Andreev, N. D. (editor) Statistiko-kombinatornoe modelirovanie iazykov, Moscow and Leningrad, Nauka (1965)] and discussed in Altmann and Lehfeldt [Altmann, Gabriel and Werner Lehfeldt, Einfxc3xchrung in die Quantitative Phonologie, Quantitative Linguistics vol. 7. Bochum, Studienverlag Dr. N. Brockmeyer (1980), esp. pp. 195ff)], though their description is limited and it does not facilitate comparison with the present approach. Dzeroski and Erjavec [Dzeroski, S. and T. Terjavec. 1997, Induction of Slovene nominal paradigms, in Nada Lavrac, Saso Dzeroski (eds.): Inductive Logic Programming, 7th International Workshop, ILP-97, Prague, Czech Republic, Sep. 17-20, 1997, Proceedings, Lecture Notes in Computer Science, Vol. 1297, Springer, 1997. Pp. 17-20. (1997)] report on work that they have done on Slovene, a south Slavic language with a complex morphology, in the context of a similar project. Their goal essentially was to see if an inductive logic program could infer the principles of Slovene morphology to the point where it could correctly predict the nominative singular form of a word if it were given an oblique (non-nominative) form. Their project apparently shares with the present invention the requirement that the automatic learning algorithm be responsible for the decision as to which letters constitute the stem and which are part of the suffix(es), though the details offered by Szeroski and Erjavec are sketchy as to how this accomplished. The learning algorithm is presented with a labeled pair of wordsxe2x80x94a base form, and an inflected form. It is not clear from their description whether the base form that they supply is a surface form from a particular point in the inflectional paradigm (the nominative singular), or a more articulated underlying representation in a generative linguistic sense; the former appears to be their policy.
Szeroski and Erjavec""s goal is the development of rules couched in traditional linguistic terms; the categories of analysis used to label and used in the analysis are decided upon ahead of time by the programmer (or more specifically the tagger of the corpus), and each individual word is identified with regard to what morphosyntactic features it bears. The form bolecina is marked, for example, as a feminine noun singular genitive.
In some respects, the problem in the present application is akin to the problem of word-breaking in Asian languages, a problem that has been tackled by a wide range of researchers. From the current perspective, the most interesting attempt is that provided by de Marcken [de Marcken, Carl. 1995. Unsupervised Language Acquisition, Ph.D. dissertation, MIT (1995)], which describes an unsupervised learning algorithm for the development of a lexicon that de Marcken applies to a written corpus of Chinese, as well as to written and spoken corpora of English (the English text has had the spaces between words removed). A Minimal Description Length (MDL) framework is employed [Rissanen, Jorma. Stochastic Complexity in Statistical Inquiry. World Scientific Publishing Co.(1989)], and reports excellent results. The algorithm begins by taking all individual characters to be the baseline lexicon, and it successively adds items to the lexicon if the items will be useful in creating a better compression of the corpus in question. In general, a lexical item of frequency F can be associated with a compressed length of xe2x88x92log2 F, and de Marcken""s algorithm computes the compressed length of the Viterbi-best parse of the corpus. The compressed length of the whole is the sum of the compressed lengths of the individual words (or hypothesized chunks) plus that of the lexicon. In general, the addition of chunks to the lexicon (beginning with such high frequency items as xe2x80x9cthxe2x80x9d) will improve the compression of the corpus as a whole, and de Marcken shows that successive iterations add successively larger pieces to the lexicon.
Applying de Marcken""s algorithm to a corpus of a language such as English from which the white space has not been removed provides interesting results, and what it provides is essentially an excellent system of compressing words of the language in question. However, the results are too uncontrolled linguistically since they do not correspond well to the linguist""s morphemes, and they do not provide much knowledge about the present problem of morphology extraction. In particular, the morphology task is especially interested in the cooccurrence relationships between stem and affix, while de Marcken""s MDL based approach specifically eschews such matters. In addition, it was not designed to speak to the question of stem allomorphy.
The present application is particularly addressed to the stem-suffix structure possessed by most words in Indo-European languages, which is to say that the stems and inflectional suffixes will generally always appear in words of no more than two morphemes (i.e., of the form stem+affix). Languages such as Hungarian and Finnish do not have this characteristic. The average number of morphemes per word is considerably greater than 2, and identification of the morphological system cannot be bootstrapped in such languages from the occurrences of stems and suffixes found in words of two morphemes.
De Marcken""s MDL-based figure of merit for the analysis of a substring of the corpus is the sum of the inverse log frequencies of the components of the string in question. The best analysis is that which minimizes that number (which is the theoretical length of the compressed form of that substring). One might try using this figure of merit on words in English constrained to allowing analysis of all words into exactly two pieces. The results are of no interest, however. Words will always be divided into two pieces where one of the pieces is the first or the last letter of the word, since individual letters are so much more common than morphemes.
Accordingly, there is a need for a method and apparatus for automatically learning the morphology of a natural language.
An apparatus and method in accordance with the present invention provide automatic and unsupervised morphological analysis of a natural language. Prefixes, suffixes and stems of the language in question are determined, along with regular patterns of suffixation. In addition, major patterns of stem allomorphy are determined.
The method and apparatus divide naturally into three subareas. First is the determination of the correct morphological split for individual words. Second is the establishment of accurate categories of stems based on the range of suffixes that they accept. Third is the identification of allomorphs of the same stemxe2x80x94so-called irregular stems. Each of these is considered in more detail.
1. Splitting words: The goal is to accurately analyze any word into successive morphemes in a fashion that corresponds to the traditional linguistic analysis. Minimally, the system identifies the stem, as opposed to any inflectional suffixes. Ideally, the system also identifies all the inflectional suffixes on a word which contains a stem that is followed by two or more inflectional suffixes, as well as identifying derivational prefixes and suffixes. The system indicates that in this corpus, the most important suffixes are -s, -ing, -ed, and so forth, while in the next corpus, the most important suffixes are -e, -en, -heit, -ig, and so on. Of course, the system is not a language identification program, so it will not name the first as xe2x80x9cEnglishxe2x80x9d and the second as xe2x80x9cGerman,xe2x80x9d but it will perform the task of deciding for each word what is stem and what is affix.
2. Range of suffixes: The most salient characteristic of a stem in the languages considered here is the range of suffixes with which it can appear. Adjectives in English, for example, will appear with some subset of the suffixesxe2x80x94er/-est/-ity/-ness, etc. The system determines automatically what the range of the most regular suffix groups is for the language in question, and ranks suffix groupings by order of frequency in the corpus.
3. Stem allomorphy: All languages exhibit allomorphyxe2x80x94that is, cases of distinct stems that should be identified as corresponding to a single lemma. In a few cases, the stem allomorphy is radically replacive, as in the past stem went for the English verb go. Such extreme cases typically arise from the collapsing of two distinct words in the history of a language, as was the case in this example, for English has indeed taken the past form of the verb wend and assigned it to the verb go. Much more common is stem allomorphy based on regular or semi-regular patterns. In English, many cases of this exist in the verbal systems going back thousands of years to Indo-European alternations, as in sing/sang or catch/caught. Since the present system is concerned exclusively with written text, we can include in this category the stem allomorphy found in stems with short vowels: hem/hemm, say, as in hem/hem-s/hemm-ed/hemm-ing. In Spanish, many verbs have two stems, one with a simple vowel when the stress falls on the suffix, and one with a diphthong when stress falls on the stem (duerm/dorm xe2x80x98sleepxe2x80x99); in German, many lemmas have an umlauted and an non-umlauted form (Haus-/Hxc3xa4us xe2x80x98housexe2x80x99, used in the singular and the plural forms, respectively). An automatic morphological analyzer should identify such regular and semiregular patterns as much as possible, though identification of completely suppletive forms is impossible if one does not take higher-order syntactic information into account.