The present invention relates in general to computer-implemented systems and techniques for inputting and encoding characters, and in particular to computer-implemented systems for inputting and encoding ideographic characters, such as those used in the Chinese, Korean, or Japanese language.
The English language uses its 26 letter alphabet to construct several hundred thousand words in left to right linear groups of varying lengths and combinations of letters. Each letter is associated with a sound. Ideographic-based languages, such as Chinese, Korean, or Japanese are constructed differently. To simplify the discussion, the Chinese language will be discussed in great detail herein although it should be kept in mind that the discussions, techniques, and systems herein also apply to other ideographic-based languages, such as Korean or Japanese.
In Chinese, the most elementary unit in the language is said to be not an alphabet symbol but a “character”—a symbol which is usually equivalent to an English word. Chinese characters are composed of various combinations of up to over 30 different pen-strokes. These strokes are not, by themselves, associated with any sound (as are the symbols of the English alphabet), and when combined to form a character, the combination is pronounced differently according to dialects—even though universally read with the same meaning. While there are said to be as many as perhaps 50,000 characters, including ancient and very esoteric ones, most reasonably well-educated Chinese are familiar with no more than roughly 6,000 to 8,000 different characters. It has been estimated that 98% of written communication is done from a pool of only 3,000 different characters. This is not to say that most people use only a few thousand different “words”, however. These characters are also commonly used in combinations of two or more to create more complex words or phrases. For example, the character for ten added to the character for moon forms a group of two characters which can mean October.
The major obstacle to creating a practical Chinese typewriter was that the strokes are not used in a linear fashion to construct characters, and the size, or—in the case of some strokes—even their proportions, vary greatly. Each character is constructed in an imaginary box of the same size as that of all other characters regardless of how many strokes are needed. This means that the same stroke can be of various lengths or various proportions as it is squeezed or elongated to fit into appropriate elements in various characters in the imaginary boxes. In order to construct a character, the needed strokes are all placed appropriately within the box—some must go in the middle, some left, some right, some on the bottom, some on the top, and some cut through the entire figure. In other words, there is no physical linearity, as with English in how the strokes are set down. However, there is a traditional time sequence and linearity to the strokes, which is taught in school and tends to be remarkably uniform from writer to writer.
With these differences, it is not surprising that a practical keyboard, which is, after all a device built for linear, alphabetic languages, could not be successfully adapted to Chinese. With all the up and down and back and forth movements, as well as all the various sizes and forms needed for building a character from strokes or elements, an enormous keyboard requiring thousands of keys or a very complicated keyboard would be needed to write with a one character or even a one element per key approach.
In the past, various attempts have been made to simplify the task of computer-implemented inputting and encoding ideographic characters. One of the more common approaches today for Chinese and Japanese are phonetic systems, which are time consuming to learn and to operate and are burdened with the problems of the subtle differences in pronunciation. The most prominent of these systems also require the use of an English keyboard and some familiarity with English pronunciation. There remains widespread dissatisfaction with these systems, however, and the search for a better solution continues.
Another approach employs computer-implemented handwriting recognition software. Using this approach, the entire character inputted would be recognized, using a statistics-based pattern recognition algorithm. This approach also has its limitations. Under this approach, the writer must write the character such that the difference between the character furnished to the recognition device and that employed as the reference in the computer memory is within a statistically acceptable range. As with most statistics-based systems, some writers will invariably write significantly differently from the majority and hence be outside of the portion of the bell curve of the statistical sample. For these writers, recognition accuracy will be poor. Even with the same writer, some written characters may be recognized well because they fall within the statistical envelope and others will be recognized poorly since the writer writes these latter characters significantly differently from the majority of the statistical sample.
Other exemplary approaches include that described by Robert O'Dell (U.S. Pat. No. 5,109,352, which also describes various other approaches in the background section), Li (UK Patent GB2100899), and an earlier system created by Jiang Zheng (Character Indexes of Modern Chinese, by N. H. Leon, Scandinavian Institute of Asian Studies Monograph Series, No. 42, Curzon Press), all incorporated by reference herein.
The above-discussed inputting and recognition problems are exacerbated when inputting is done with a hand-held device, such as a telephone (wired or wireless), a personal digital assistant (PDA) with an abbreviated keyboard or another form of data entry such as a proprietary key arrangement or a touch-sensitive writing area or display. This is because in these devices, there are severe limitations in the physical size of both the input and display subsystems, which limit the amount of space available for “writing” and for displaying the partial or full result. Irrespective of whether the computer-implemented device is desktop-based or hand-held, however, it is felt that additional improvements can be made to improve the speed and accuracy of ideographic character input and encoding, as well as to improve the ease with which a writer can rapidly achieve the accurate entry of the intended character from as few written strokes or keyed entries as possible.