1. Technical Field
The invention disclosed broadly relates to data processing and more particularly relates to linguistic applications in data processing.
2. Background Art
Text processing word processing systems have been developed for both stand-alone applications and distributed processing applications. The terms text processing and word processing will be used interchangeably herein to refer to data processing systems primarily used for the creation, editing, communication, and/or printing of alphanumeric character strings composing written text. A particular distributed processing system for word processing is disclosed in the copending U.S. patent application Ser. No. 781,862 filed Sept. 30, 1985 entitled "Multilingual Processing for Screen Image Build and Command Decode in a Word Processor, with Full Command, Message and Help Support," by K. W. Borgendale, et al., now U.S. Pat. No. 4,731,735. The figures and specification of the Borgendale, et al. patent application are incorporated herein by reference, as an example of a host system within which the subject invention herein can be applied.
Three techniques for verifying spelling of compound words have been used by word processing systems. One prior art technique contains all the compound words that the system was able to verify stored in a dictionary data base. Verification consists of checking the dictionary for a match. An obvious limitation of this technique is the enormous amount of storage required to obtain passable coverage. Comprehensive coverage is impossible, particularly in the Germanic languages, because word compounding is used so extensively that a dictionary of all meaningful compounds cannot be constructed.
A second prior art technique described in the copending patent application Ser. No. 664,184, filed Oct. 24, 1984, now U.S. Pat. No. 4,672,571, and assigned to the IBM Corporation, consists of parsing, or separating, the constituent words of the compound and then checking them against the words stored in the dictionary data base. This technique is the only practical way of obtaining adequate verification of compound words, but the approach is prone to problems such as false coordination of components and imprecise determination of the "joints" between the word components. In this technique, certain letter pairs which had a high probability of being the "joints" between components were used as clues for breaking the words and then verifying the parts against the dictionary. In terms of system performance, since any unrecognized word must be parsed before it can be marked as misspelled, the parser must have a large number of break points in order to verify correct compounds. Thus, the identification of incorrect words is slowed down and degrades the performance of the system. Also, since languages that use compound words have longer average word lengths than non-compounding languages, the wasted computer time to try all the combinations allowed by the list of "joint" letter pairs can be considerable. As mentioned above, this second parsing technique suffers from false coordination errors. That is, a misspelled word consisting of two correctly spelled components will be considered correct. For example, if the word "overtime" is misspelled as "evertime" the word would be considered correct by this technique since "ever" and "time" are both correctly spelled components. Similarly, run-on words such as "thatis" will be verified as "correct" compounds. Ambiguities in identifying the components of a compound can lead to incorrect hyphenation. For example, "snakeskin" may be interpreted as "snakes-kin."
A third prior art technique described in the copending patent application Ser. No. 664,183, filed Oct. 24, 1984, now U.S. Pat. No. 4,701,851, and assigned to the IBM Corporation, consists of parsing the components of a compound word and checking against a dictionary for compound flags associated with each word to see if the components are associated in a permissible sequence. This prevents words that can be purely prefixes such as "pseudo" from verifying either in isolation or in a position other than the beginning of the word. Six compound flags are used to categorize word components. The six types of flags are: (1) word is uncompoundable; (2) word can be used alone or in the front or middle; (3) word can be used alone or in any position; (4) word can be used alone or in the back of a compound; (5) word can be used at the front or middle of a compound; and (6) the word can be used only at the back of a compound. While this technique is better than either of the first two techniques, the approach is limited because the six flags are insufficient to describe all situations. False coordination can occur as in the second technique, and some words will fail to verify because the compound flags assigned to the component words in the dictionary inadequately describe their function when used in a compound. A further deficiency of the third technique which results in wasted computer time is the isolation of all the possible components and their compound flags in an initial stage, followed by a second stage where the compound flags are examined. The inefficiency of the technique results from the permutations of components that have to be considered even when some of the components have compound flags that will eventually result in an invalid combination. The mechanism used in this technique also makes it impossible to account for letter elisions used during compounding as in the German word "schiffahrt" which has to be decomposed into "schiff" and "fahrt" in order to verify properly.
The shortcomings of the prior art techniques have made it necessary to develop a more efficient algorithm and a comprehensive set of compound codes to handle agglutinative languages in computerized applications adequately.