1. Technical Field
The present invention relates to symbol recognition and identification. More particularly, the present invention relates to identifying characters in ideographic alphabets.
2. Description of the Prior Art
Chinese, Japanese, and Korean scripts are based on ancient Chinese characters which make up an ideographic alphabet comprising more than 50,000 characters.
The characters of an ideographic alphabet are each composed of simpler, constituent parts known as radicals. Radicals are the building blocks of ideographic characters and combine in certain predetermined ways to form the characters of an ideographic alphabet. Under current practice, a set of 214 radicals is used in various combinations to produce the characters of the Chinese alphabet. Each radical, in turn, is made up a series of specific and precisely defined strokes. There are currently about 40 individual stroke shapes in use which, based on variations in size, require the mastery of 82 strokes before practical writing skills for Chinese ideographs are obtained.
The sheer size of ideographic alphabets presents unique challenges for specifying and identifying individual characters, particularly for data entry and data processing. Various schemes have been proposed and descriptions can be found in the literature. See, for example, Y. Chu, Chinese/Kanji Text and Data Processing, IEEE Computer (January 1985); J. Becker, Typing Chinese, Japanese, and Korean, IEEE Computer (January 1985); R. Matsuda, Processing Information in Japanese, IEEE Computer (January 1985); R. Walters, Design of a Bitmapped Multilingual Workstation, IEEE Computer (February 1990); and J. Huang, The Input and Output of Chinese and Japanese Characters, IEEE Computer (January 1985).
These methods can be divided into five broad categories which are described below.
1. Direct keyboard input.
Direct keyboard input requires a large keyboard in which a user searches for each character or character group and presses one or more keys to generate a code corresponding to the desired character. These types of keyboard based systems are bulky, unwieldy, and are difficult to expand. Additionally, such keyboards are not particularly intuitive. That is, one using a keyboard-based input system for an ideographic alphabet must possess a significant level of familiarity with the alphabet before being trained in the use of the keyboard. There is no efficient way to minimize hand and finger movement during the process of data entry using such keyboards because of the large number of keys and the minimum key spacing necessary. As a result, excessive hand movement, as well as time spent hunting for the desired character means that data input rates using such keyboards by even the most skilled users are only slightly better than those of one skilled in writing ideographic scripts.
2. Phonetic based input.
Phonetic based input may employ either a standard ASCII keyboard on which each of the keys is assigned a unique phonetic symbol value; or it may employ a phonetic code value in which the phonetic values are each assigned a two-character code, where each code is generated by pressing two keys on a standard ASCII keyboard. There may be additional variations on this same basic concept.
There are various implementations of phonetic based input systems but the basic idea is to type a representation for the sound of a character on the keyboard rather than directly inputting it as in the Direct keyboard input method described in the previous section. For example, one such method for Chinese uses a keyboard consisting of 37 phonetic symbols, either directly mapped one symbol per key or through a two character sequence.
Another common method requires the user to specify the character's sound by typing the romanized equivalent of its pronunciation on a standard QWERTY style keyboard.
Because many ideographic characters can have the same sound, character entry in phonetic based systems requires a special module known as a front end processor. The front end processor takes as input the sound of the desired character, typed phonetically on the keyboard, and produces as output, a menu of possible characters having that sound. The user must select the desired character from the menu.
Phonetic based character entry and selection is slow and tedious. Furthermore, this method can only be used if the correct pronunciation of the character is already known.
Examples of front end processors for Japanese input include Wnn (developed by the University of Kyoto), Canna (developed by Software Research Associates), and Clare (developed by the Canon corporation).
As mentioned, a characteristic of ideographic alphabets is that there is usually more than one character for a given pronunciation, and there may be regional variations in the pronunciation of a particular character. There are also very subtle and complex distinctions in language sounds that may not be accurately expressed in a predefined set of phonetic values. Different sets of phonetic symbols would be required to properly represent a particular dialect.
Another problem with these systems is when the pronunciation of a character is not known but it is still necessary to input the character. For example, a translator may need to look up the meaning of an unknown character in an on-line dictionary or glossary without knowing the pronunciation. Phonetic based systems cannot be used in such cases and the translator must stop work and manually look up the character in a dictionary.
3. Attribute based input.
Attribute based input systems associate a unique attribute or set of attributes with each ideograph in the character set. There are many variations on this theme but in its simplest form a unique code is assigned to each character. To access a particular character, its unique code is typed on the keyboard and the character will appear on the display screen. Examples of standard character encoding schemes for ideographic characters include Japanese Information Standard (`JIS`), Shift-JIS, and EUC for Japanese, BIG5 for Chinese, and Unicode which encodes all ideographic alphabets.
A variation on this same theme is to choose attributes that are intuitively easier to remember than numeric codes. For example, any of the number of strokes in the character, the main radical, or the shape of the character could be used to specify a character. Examples of products that use this look-up method are the Wizard Denshi Techou manufactured by Sharp of Japan, and the Casio Wordtank manufactured by Casio of Japan. These are both handheld Japanese character dictionaries that allow the user to specify by menus, several kinds of attributes like those mentioned above. A further example of this approach is MacSunrise developed by Japan Media, a kanji learning tool which accomplishes the same function by clicking on menus and icons with a mouse. Attribute based systems are cumbersome and difficult to use because they are not particularly intuitive. They require knowledge of the attribute itself, which could be difficult for code based systems, or they require an analysis of the character to be looked up followed by a specification of the appropriate attributes, two very different kinds of actions (right brain and left brain) that are not easily mastered.
4. Radical or pattern based input.
One approach to ideographic character identification which divides characters into radicals or similar patterns is based on the Three-Corner Coding Method. This method sorts patterns of strokes into a logical system of 99 major and 201 minor symbols that may be represented in tabular form in a 10.times.10 square. Each symbol is assigned two numbers which are derived from the vertical axes in the table. The three-corner code for any symbol is determined by entering six digits, which correspond to three of the symbols appearing at three of the desired character's corners. This system has proven reliable for generating unique characters, but is slow and tedious. In operation, one must either memorize all of the six-digit codes, or one must hunt through the table and then enter the six-digit code.
5. Other specialized input methods.
Another method has been proposed in U.S. Pat. No. 4,829,583, Method and Apparatus for Processing Ideographic Characters, issued to Monroe et al, in which a specific sequence of strokes is entered into a 9.times.9 matrix, referred to as a training square. This sequence is matched to a set of possible corresponding ideographs. Because the matrix senses stroke starting point and stroke sequences based on the correct writing of the ideograph to be identified, this system cannot be used effectively until one has mastered the writing of the ideographic script.
In addition to the foregoing methods of generating and/or identifying ideographic characters, handwriting recognition systems have been proposed, but these systems require the user to be proficient in writing the ideographic characters and are sensitive to variations in individual writing styles; optical character recognition systems have been proposed, but the technology to accomplish optical character recognition for ideographic characters is still very primitive and prone to high error rates. Voice recognition systems have also been proposed but these are still very primitive and the technology is many years away from being practical; it also requires a user conversant in the language represented by the ideographic alphabet. See, for example, R. Matsuda, Processing Information in Japanese, IEEE Computer (January 1985).
So far, all known character encoding, identification, and recognition schemes for ideographic alphabets have all or most of the following flaws:
They are inefficient in terms of keystrokes per character; PA1 They take considerable time and patience to learn; PA1 They make data entry a slow, burdensome, conscious task; PA1 They are limited to a specific alphabet and they are not easily updated nor are they readily exchanged for other alphabets; PA1 They all require previous knowledge and competence with the language underlying the ideographic alphabet.
A simple, fast, easy to use system for generating, identifying, and recognizing characters in ideographic alphabets has been heretofore unknown. Yet such system is needed to provide access to such alphabets for those at all skill levels with the language, written and/or spoken, underlying such ideographic alphabets.