When the spelling of compound words has been automatically verified by word processing systems, one of two techniques has been utilized. With one prior art technique, all compound words that the system is capable of verifying as correctly spelled are stored in a dictionary data base. The word to be verified (sometimes referred to as the input word) is compared against all words stored in the dictionary for a possible match. Utilizing the second prior art technique, the input word is parsed, or separated, into its constituent words. The constituent words are then used as input words to be compared against the words stored in the dictionary data base.
One of the obvious limitations of the first of the two prior art techniques described above is that of the memory size or storage space required to store a dictionary data base large enough to include all foreseeable compounds of words. In many languages, particularly in Germanic languages, word compounding is an active way of creating new words in these languages; therefore, storing all meaningful compounds in a dictionary data base is, quite simply, impossible.
Accordingly, the second prior art technique described above is the only meaningful way to achieve the greatest compound word verification potential, although the approach of parsing compounds into their constituent parts and verifying these parts has had several significant limitations in its operation.
One example of this parsing technology is found in the IBM Displaywriter TextPack 4 program which runs on the IBM Displaywriter System. With this spelling verification system, certain letter pairs were known to be most frequent at the "joint" between compound constituents, and these letter pairs could be used as clues to scanning a word for possible breaking points. For example, many English words end in the letter "t" and many words begin with the letter "s". Thus, the pair "ts" is a good candidate for a break point for parsing English compound words. The word in question is scanned and broken between each set of possible break points found in the word. Each resultant piece of the word is then compared to the words in the dictionary data base. Therefore, in attempting to verify the word "hotspot" as a correctly spelled word, the parser would find the "ts" break point, break the word in to "hot" and "spot" and would then find both of these parts in the dictionary. The word would then be judged correctly spelled, and on this basis could also be hyphenated between the constituents, e.g.--"hot-spot".
The problems of the approach described immediately above lie in the fact that likely break points are also common letter pairs at places other than the joints between compound constituents. This fact causes a number of serious flaws in the operation of such a method. In terms of system performance, since any unrecognized word must be parsed before it can be marked as misspelled, the parser must have a large number of break points in order to verify correct compounds. Thus, the identification of incorrect words is slowed down correspondingly, which degrades the performance of the system. Since compounding languages have longer average word lengths than non-compounding languages, the wasted time and effort expended in trying all of the "possible" combinations (according to an extensive break point list) can be considerable. For example, a comparable process in the English language might produce a word like "compoundwordspellingverification". The number of operations required to break a compound word of this length at all possible break points, look up the resultant constituents, and possibly apply another level of parsing to one of the pieces is clearly quite large.
The break point parsing technique described above sometimes becomes "confused" when several letters in the first part of a compound look like another good word. If the remainder of the compound cannot be found and the parsing algorithm does not successfully recover, a perfectly good word may be marked as misspelled, even though its constituents are in the dictionary.
A further limitation of the break point parsing technique described above involves hyphenation errors. If the parser is contributing information to an automatic hyphenation program, correctly spelled words may suffer from mishyphenation on the basis of break points when the wrong pair of words is identified as the constituents of a compound. For example, "snakeskin" might be incorrectly parsed as "snakes-kin", or "pantscuff" might be incorrectly parsed as "pant-scuff". This turns the word to apparent nonsense for a reader who tries to reconstruct it on the basis of its supposed hyphenation at the compound joint.
Accordingly, although it is preferred to use a parsing technique to verify compounds for spelling verification, rather than attempting to store all meaningful compounds in a dictionary (which is clearly impossible), it would be of great benefit to have available a high performance parsing algorithm which minimizes errors and allows verification of compound words not stored in the dictionary.