1. Technical Field
The invention disclosed broadly relates to data processing and more particularly relates to linguistic applications in data processing.
2. Related Patent Applications
The following related patents and applications are incorporated herein by reference:
U.S. patent application by B. Knystautas, et al. entitled "Linguistic Analysis Method and Apparatus," Ser. No. 853,490, filed Apr. 18, 1986, abandoned, assigned to IBM Corporation.
U.S. Pat. No. 4,328,561 by D. B. Convis, et al. entitled "Alpha Content Match Prescan Method for Automatic Spelling Error Correction," assigned to the IBM Corporation.
U.S. Pat. No. 4,355,371 by D. B. Convis, et al. entitled "Instantaneous Alpha Content Prescan Method for Automatic Spelling Error Correction," assigned to the IBM Corporation.
3. Background Art
Text processing and word processing systems have been developed for both stand-alone applications and distributed processing applications. The terms text processing and word processing will be used interchangeably herein to refer to data processing systems primarily used for the creation, editing, communication, and/or printing of alphanumeric character strings composing written text. A particular distributed processing system for word processing is disclosed in the copending U.S. patent application Ser. No. 781,862 filed Sept. 30, 1985 entitled "Multilingual Processing for Screen Image Build and Command Decode in a Word Processor, with Full Command, Message and Help Support," by K. W. Borgendale, et al., now U.S. Pat. No. 4,731,735, assigned to IBM Corporation. The figures and specification of the Borgendale, et al. patent application are incorporated herein by reference, as an example of a host system within which the subject invention herein can be applied.
The morphological text analysis invention disclosed herein finds application in the parser for natural language text which is described in the copending U.S. patent application Ser. No. 924,670 filed Oct. 29, 1986, by Antonio Zamora, et al., assigned to IBM Corporation. The figures and specification of the Zamora, et al. patent application are incorporated herein by reference as an example of a larger scale language processing application which the subject invention herein can be applied.
Glossary:
This description uses specialized linguistic terminology; the most common terms are defined here:
DESINENCE--An ending or affix added to a word and which specifies tense, mood, number, gender, or other linguistic attribute.
LEMMA--The standard form of a word which is used in a dictionary (generally the infinitive of a verb or the singular form of a noun).
MORPHOLOGY--The study of word formation in a language including inflections, derivations, and formation of compounds.
PARADIGM--A model that provides all the inflectional forms of a word (conjugations or declension).
STEM--A portion of a word that does not change and to which affixes are added to form words. The stem itself is not necessarily a word.
Morphology is an important aspect of the study of languages which has many practical applications in data processing. Of particular relevance to this invention is the study of the alteration of words by the use of regular changes. For many centuries the study of romance languages (Spanish, French, Italian, etc.) and other Indo-European languages has placed emphasis on the "stem" and "desinence" of verbs. The "stem" is the portion of the verb that remains invariant during conjugation whereas the "desinence" is the portion that changes according to some paradigm or pattern. In English, the number of desinences of a verb are relatively few ("-ed," "-s," and "-ing" for the past tense, the third person, and the present participle, respectively), but in European languages there may be 50 or more verb forms. Although classification schemes have been devised for natural languages, most are incomplete and not systematic enough for use in computer-based systems for a variety of languages.
A prior art technique associates ending table references with the entries of an existing dictionary and is able to identify the grammatical form of an input text word and its invariant root form. However, the technique does not provide a mechanism for using the ending tables as part of the dictionary structure, and this results in unnecessary redundancy of data which does not yield a very compact representation. Another consequence of the organization of the data is that an additional access is required to find the ending table associated with each dictionary word.
A second prior art technique, U.S. Pat. No. 4,342,085 entitled "Stem Processing for Data Reduction in a Dictionary Storage File," assigned to IBM Corporation, provides a compact dictionary representation by encoding common prefixes and suffixes, but the encoding mechanism is only a compacting scheme and does not provide any linguistic information.