There is great interest in applying the principles of morphology to problems in natural language processing in the fields of information management and data processing. Morphology is that part of linguistic study which is concerned with the formation of words in a natural language (such as English, French, German or Russian). It is the study of what people do everyday to form words in speech and writing. Of interest is how words are formed to provide meaning in different grammatical contexts.
People communicate by organizing words into larger structures like phrases, clauses, sentences and paragraphs, according to the accepted grammatical, syntactic and spelling conventions of a given natural language. The grammar of a language, groups words into categories that denote the principal parts of the language (or parts of speech), such as nouns, verbs, adjectives and adverbs. Within each part of speech category, a word can have one of several possible inflections, each marking distinctions in use such as gender, number, tense, person, mood or voice. A word, of course, can be used in more than one part of speech. When a word is used in a sentence, it appears in an inflected form that evokes the general meaning of the word and also marks some of the grammatical distinctions in use. For example, the idea of existence or "being" is conveyed in the different forms of the verb "to be". In these sentences:
I am tall. PA1 She is taller. PA1 We are the tallest people in the room. PA1 Rule: To write the present tense of French verbs whose
the different inflected forms of the verb "to be" convey the idea of "being", tailored to the different subjects (e.g., first-person singular, second-person singular, first-person plural) and a tense (e.g., present tense) of the verb. Thus, a word is represented in a language through the collection of inflections, used to denote meaning as applied in a particular grammatical context. Some languages are more highly inflected than others. In English for example, word order (i.e., syntax) is a dominant factor in determining meaning, and, correspondingly, English words appear in only a relatively few number of inflected forms. In comparison, languages such as German place less importance on word order and, correspondingly, rely more on word form. Words in those languages have many more inflections. Inflectional morphology is the study of the different ways in which word forms are constructed in each language.
Morphology employs terms to formally describe the systems of construction that people use to form words in a natural language. Each construction of a given word is called a form, and a form encountered in a passage of text is called a surface form. The entire set or family of forms for a given word is called a lemma. Each lemma is identified by one form from its family (such as the form commonly found in a dictionary) called a citation form (and sometimes called a canonical form).
The words in any language are formed according to an inflectional morphology, which is defined to be the alteration of a basic word form to express the grammatical distinctions of a word within its part of speech category. Alteration is generally defined to mean the concatenation of one or more affixes to a primary or base word form called a stem. A stem is a group of characters, which roughly corresponds to the citation form of a lemma and evokes the primary meaning of the word. Affixes are a small, closed class of character strings, which are added to a stem either as a prefix or a suffix. Affixes have a systematic grammatical effect, when combined with a stem, that provides an indication of that word form's part of speech category and inflectional characteristics. The indications that denote part of speech and specific inflectional characteristics are called the morpho-syntactic properties of a word.
When people speak or write in a natural language, they instantly form words for a given grammatical context, because they have learned (usually as children in grammar school) that most words inflect according to a number of grammatical construction and spelling rules. The rules for constructing word inflections are systematic and generally apply to a large number of words within a grammatical category. Commonly available grammar books tend to group the words which inflect by the same rules, such as noun declension and verb conjugation groups, often illustrating the rule with examples of a typical word, such as this exemplary description of the rule for French second conjugation verbs:
______________________________________ infinitive form ends in "ir" (such as the verb reussir), take the stem (reuss) and add the following endings: ______________________________________ 1st person singular je reuss.vertline.is 2nd person singular tu reuss.vertline.is 3rd person singular il/elle reuss.vertline.it 1st person plural nous reuss.vertline.issons 2nd person plural vous reuss.vertline.issez 3rd person plural ils/elles reuss.vertline.issent ______________________________________
See, e.g., Nebelad and Frederich, French Grammar, Monarch Press, (New York, 1971). A grammar book table, such that above, includes the construction of an intermediate form stem from the infinitive form of the verb (reuss), a set of suffixes to add to this intermediate stem, and a set of grammatical features (morpho-syntactic properties) that correspond to the resulting word forms. A system of grammatical construction rules that apply to a number of words, such as the group of French second conjugation verbs, is called an inflectional paradigm. By associating a word with an inflectional paradigm, people can create inflected forms for that word, using its construction rules.
In addition to inflectional paradigms, word formation in any language is also governed by a number of orthographic rules. These spelling rules differ from the inflectional paradigms for grammatical construction, because their application is not based on any grammatical purpose. Grammatical construction rules, such as the paradigm to conjugate the present tense forms of the French verb reussir shown above, aid in forming inflections with distinct morpho-syntactic properties. Orthographic rules, on the other hand, are used to make a word conform to certain phonetic or other conventions, apart from grammatical considerations. Moreover, orthographic rules may apply across different grammatical paradigms.
For example, when adding an "s" to a word ending in "y" in the English language (e.g., "fly" or "army"), there is a spelling rule to follow: change the "y" to "i" and add "es" (e.g., "flies" or "armies"). That rule generally applies in both pluralizing nouns and conjugating verbs. In the French language, an orthographic rule governs the phonetic softening of the letter "c" as it appears in words, at times replacing the "c" with a cedilla ".cedilla.". If the "c" appears in a French word form and is followed by the letter "a", "o" or "u", a cedilla must be used, as in the noun form le gar.cedilla.on and the present tense first person plural form of the verb nous pla.cedilla.ons. In such cases, both the inflectional paradigm form construction rule and the orthographic rule govern the formation of the inflected form.
In any language, some word forms are exceptional and do not inflect according to any paradigm. People simply learn these irregular inflections, as for example, the conjugations of the present tense forms of the French language verb etre:
______________________________________ 1st person singular je suis 2nd person singular tu es 3rd person singular il/elle est 1st person plural nous sommes 2nd person plural vous etes 3rd person plural ils/elles sont ______________________________________
In English, everyone knows that nouns are pluralized by adding an "s", but there are many fish, many sheep and many other groups of words to which neither a grammatical paradigm nor an orthographic rule applies.
People who are familiar with a natural language are generally skilled in using the inflectional paradigms and spelling rules (usually learned in grammar school). They are quick to recognize and form words within a grammatical context by manipulating inflectional paradigms and spelling rules and, when necessary, remembering the exception forms. For example, when a person comes upon a word in a book that she does not know, she must manipulate the word form found in the text, and determine a corresponding citation form, in order to look up the word in a dictionary. Even an unabridged dictionary for a given language does not list and define every inflection of every word. Rather, dictionaries generally provide a single entry for each word, identifying each entry with only the citation form. For example, the citation form for English verbs is the infinitive form; for English nouns it is the singular form. With each citation, the dictionary provides, in addition to the citation form, information on a word, such as its part of speech characteristics, a definition and some inflectional variants of the word. Thus, a dictionary is a source for all basic forms of a word, but only the basic forms. Morphologists define a lexicon to be a source of basic word forms, such as an unabridged dictionary. With a lexicon and a set of paradigms a person can inflect all of the possible word forms. People can inflect forms for even unfamiliar words, if they know that the unfamiliar word inflects according to a known paradigm.
Moreover, people easily remember and use the grammatical construction and orthographic rules, because the construction rules they have learned (usually in grammar school) capture the linguistic generalities of a natural language. Orthographic rules are not specific to inflectional paradigms, but rather apply across paradigm boundaries. In some cases, a given orthographic rule applies only to a subset of all the words which otherwise fit the context specified by the orthographic rule. In such cases, information must be remembered (or looked up in a dictionary), to indicate whether particular inflections adhere to an orthographic rule. Additionally, the concept of word formation using inflectional paradigms, based on the taking of a base form of a word (such as a citation form), and creating an intermediate stem form to which affixes are added, applies across all paradigms. Those rules for constructing forms can be reused, as a general matter, to generate forms within the same paradigm, and sometimes in different paradigms. Thus, by reducing the myriad inflected forms down to a relatively small number of form instruction techniques, applicable across different paradigms and orthographic rules, the linguistic generalities inherent in any natural language may be captured in a way that helps people master the language.
In the fields of information management and data processing, applications have been developed to perform morphological analysis on a natural language text, attempting to replicate the functions that people do everyday. There are many applications where the ability to manipulate word forms while maintaining data on the morpho-syntactic properties as people do is important and useful. For example, morphological analysis has been used in information retrieval systems, where the words of a text file are indexed by the citation form of a word or some partial stem. In those systems, a text file search is performed by deriving a citation form or stem form of the word, and searching for entries under that citation form index. Manipulation of word forms is also important in on-line dictionary and thesaurus systems, where a word form is provided and a dictionary entry is retrieved by generating an associated citation form or other lemma indicator. In other systems, tracking the morpho-syntactic properties of a word is also important. For example in database systems that permit queries in a natural language (where the meaning of the words in a query is important), morphological text analysis has been used to process them by taking each surface form found and generating a citation form or lemma identification for the word and the corresponding morpho-syntactic properties. Maintenance of morpho-syntactic information is also important in machine translation systems (in which text passages are translated from one natural language to another), where a surface form in one language is found in a passage and translated to another language in a form having the same morpho-syntactic properties.
Generally, a computer-based system for morphological analysis of text begins with a lexicon (i.e., a set of base word forms) associated with a set of form construction rules. The hope is to build a system to do some of the things people do almost automatically: inflectional recognition and inflectional generation. Inflectional analysis is the process of taking a word form as it appears in text (i.e., a surface form), recognizing or deriving the morpho-syntactic properties of that form, and locating its family of word forms (i.e., lemma). Inflectional generation is the process of generating a particular surface form with specific morpho-syntactic properties, given a citation form in a lexicon and a set of construction rules. Additionally, there has been effort to design in an elegant way the system for morphological analysis. If possible, the design of the morphological system should support multiple natural languages.
In addition, the programming or specification of the inflectional language should be easily accomplished. If possible, the specification should follow as closely as possible the easily understood system of using grammatical rules and spelling rules, (found commonly in grammar books) and applying them to basic forms found in a lexicon source (usually the dictionary). That way anyone familiar with a natural language, not exclusively those versed in computer programming techniques, could specify a description of a language's inflectional morphology and build a system. Additionally, if the method for specifying a description of the inflectional morphology and the lexicon of a natural language permitted a compact description, there would be no maintenance problem in adding new words to the lexicon and associating them with appropriate grammatical and spelling rules. However, the currently available systems for morphological analysis present incomplete and non-elegant solutions to the problem of providing morphological text analysis in information management and data processing systems.
In terms of functionality, some of the currently available systems cannot be used for both inflectional recognition and inflectional generation. And in terms of system design, the currently available systems generally use non-elegant, highly contrived linguistic formalism to describe the inflectional morphology of a given natural language, that neither take advantage of the linguistic generalities that are inherent in the language, nor provide systems that map well to the morphological specifications found in grammar books. Such systems can have serious practical drawbacks. For example, some of the currently available systems for morphological analysis incorporate a design that does not use paradigms. Where no paradigms are used, all linguistic generalizations are lost and updating the lexicon with additional words becomes a burdensome process. Other systems employ paradigms which consist of highly artificial construction rules. These systems have serious practical drawbacks, because they do not incorporate features to take advantage of linguistic generalizations inherent in the language.
As discussed above, people remember paradigms for grammatical construction separate from orthographic rules that sometimes apply across paradigms. Moreover, because many paradigms differ in construction rules for only a few forms, paradigms can be arranged in hierarchies where the form construction rules of a "general" paradigm can be reapplied in other paradigms with certain exceptions. Thus, paradigms could be specified to inherit form rules of other paradigms. The use of features such as the specification of orthographic rules, separate from the paradigm, and paradigm inheritance of form construction rules, would enable a system to be created that would map well to grammar books, and would permit easy specification and updating by anyone familiar with the language. However, to date the presently available systems have yet to create an elegant solution. For example, even in those currently available systems where orthographic rules are separated from form construction rules, the user must explicitly specify links to the orthographical rules, or the user must transform each lexical entry into some equivalent but highly artificial lexical representation, to provide a link to an orthographic or spelling rule--both being non-trivial burdens on the users who maintain the system and populate it with new words.
Moreover, some of the formalistic design choices used in presently available systems prevent them from being used with multiple languages. In some systems, an artificial approach is taken for a specific language like English (which is not highly inflected) that is useless in creating a system for languages like Russian or German (which are both highly inflected). In other systems, the syntactic and grammatical features are hard-coded so that new code must be generated and adjustments must be made in the system for each new language.
Thus, there is a need for a system to perform morphological text analysis which provides the functionality of both inflectional recognition and inflectional generation and also incorporates design features that would take advantage of the linguistic generalities found in many languages. The creation of such a system would be a great advance to the field of information management and data processing.