Languages by which human beings communicate employ a set of symbols which comprise an alphabet. Certain of these symbols can be modified by designated signs or marks, called diacritical marks, which are positioned as required by the rules of the language. These diacritical marks may, for example, require an altered pronunciation of the symbol or base character with which they are associated.
The Japanese phonetic alphabet Katakana is of this type, and this alphabet will be used herein to describe the present invention, by way of example and not by way of limitation.
Prior art character recognition equipment has recognized the need to provide special means to accommodate characters which may have an associated diacritical mark and recognize the difference between a character with a diacritical mark and one without. U.S. Pat. No. 3,710,321 provides such an arrangement. In the device of this patent, a central, horizontal row area contains the major characters or base symbols of the alphabet. Certain of these major characters may have diacritical marks associated therewith, in areas above or below the major character. When a vertical scan of a major character, and the recognition of this character, indicates that this character is of the class that may contain a diacritical mark, then vertical scanning of the next major character is momentarily interrupted, and the scan is diverted to the diacritical area above or below (as the case may be) the just-recognized character. Thereafter, scanning of the major character continues, and special upper or lower diacritical recognition logic is enabled as such diacritical upper or lower areas are scanned.
Of the two Japanese alphabets, Hiragana and Katakana, the latter is the accepted means of interfacing or representing the Japanese language to data processing equipment by means of character recognition apparatus. Each Katakana sound is a syllable formed by adding each of the vowels A, I, U, E and O to each of the consonants K, S, T, N, H, M, Y, R and W. The combinations YI, YE, WI, WU and WE are excluded. The A, I, U, E, O, and N sounds by themselves are also included. The pronunciation for certain of these basic symbols can be modified by adding diacritical marks, i.e., either two small lines collectively called a nigori or a small circle called a maru, immediately adjacent the upper right of the basic symbol. In addition, a long vowel symbol written as a dash follows certain basic symbols to alter their pronunciation. The two diacritical marks plus the long vowel symbol are combined with certain of the basic letters to expand the overall Katakana alphabet to include 72 characters. Those Katakana characters with a diacritical mark are often called sonants, and those without a diacritical mark are called non-sonants.
This large symbol set makes manual keying a difficult, slow and costly means of data processing entry. Several prior art optical character recognition (OCR) machines have been developed that automatically read handwritten Katakana symbols. However, due to the complexity of the sonant characters, and the close location of the diacritical mark to the base symbol, these machines require that the diacritical mark be written as a separate mark, in its own character space, clearly separated from the basic character which it modifies. As a result, only the 46 basic non-sonant character shapes plus the two separated and isolated diacritical marks are machine readable, and the 25 sonant characters written in their natural (upper-right and adjacent) form cannot be read. Thus, a special set of writing rules that differs from the usual rules of the Katakana language must be utilized for these OCR machines.