Some written languages, such as Chinese, Japanese and Korean, use individual symbols (that is, ideographs, or ideographic characters) to represent individual words or concepts. Such languages do not traditionally possess an alphabet from which words and phrases are constructed. Although a language may have 30,000 or more ideographs, generally fewer are required for everyday use. However, even for everyday use, the number of ideographs with which a person must be familiar is relatively high. For example, in Chinese, a written vocabulary of 10,000 ideographs would generally be sufficient for most uses. Some official vocabularies define about 7,500 words, which would provide reasonable proficiency in written communication.
Such large numbers of different ideographs result in difficulties in written communication. A writer must have a high degree of literacy. He must have most standard ideographs memorized, or be constantly checking an ideographic dictionary. A publisher must have a substantial library of different ideographic characters. Physically, it may be difficult to store a large number of ideographic printing elements. Once in storage, it may also be difficult to relocate a particular printing element for use. Typewriters may have hundreds of keys, resulting in slow, inconvenient typing.
Similarly, in telecommunications or in electronic data processing systems, conventionally each ideographic character is assigned a unique, but arbitrary, code. In order to input text into such a system, an operator must memorize thousands of codes, or be constantly checking codes from a table. The former case requires a highly trained operator. The latter case requires large expenditures of time. In addition, tables, lists or dictionaries of ideographic characters are conventionally arranged according to phonetic rules. Accordingly, even in the latter case the operator must have some phonetic training in the language in order to be able to locate the proper codes from a phonetically arranged table.
In order to improve the use of ideographic characters, particularly in electronic applications, various methods have been proposed for the systematic encoding of ideographic characters. See, for example, U.S. Pat No. 3,665,450; 4,173,753; 4,327,421; 4,379,288; 4,462,703; 4,490,789; and 4,505,602. The specific application of such methods may vary depending on the specific language. However, in general, the principles for such methods are similar.
Generally, the ideographic characters of the language are analyzed into their component strokes. Such strokes are compared to a pre-determined set of idealized basic stroke elements. Each basic stroke element may correspond to one or more actual component strokes. Each stroke in the character is identified with one of the idealized basic stroke elements. For example, U.S. Pat. No. 4,173,753 teaches that most Chinese characters may be reasonably encoded by considering only six basic stroke elements--a horizontal stroke, a vertical stroke, a dot, a dash (oblique) stroke, a clockwise angular stroke and a counter-clockwise angular stroke. As shown in various examples therein, different ideographic characters may be encoded, and even approximated, by considering the same six basic stroke elements. U.S. Pat. No. 4,173,753 teaches a relatively simple scheme for encoding. Others, such as U.S. Pat. No. 3,665,450, may teach a more complex method.
In conventional coding methods (as exemplified by U.S. Pat. No. 4,173,753), the encoding of an ideographic character is accomplished by assigning a unique element code (for example, a number or letter) to each basic stroke element. The character code is a string of code components, each of which is an element code. The code components of a character code correspond to the component strokes of the ideographic character.
The code components for an ideographic character are ordered according to a pre-determined sequence or pre-determined rules to form the character code. The character code may then be used as desired, for example, as input to a telecommunications system or an electronic data processing system. Typically, the pre-determined sequence or rules correspond to those of traditional character writing. For example, in some languages, the component strokes of an ideographic character may be written from the upper left hand corner to the lower right hand corner. Accordingly, an encoding scheme for such language may typically follow the same convention. That is, the first code components in the character code would correspond to the component strokes of the upper left hand corner of the ideographic character. The last code components of the character code would correspond to the component strokes of the lower right hand corner of the ideographic character.
Conventional coding techniques pose numerous difficulties. First, a highly skilled and literate operator is required, because of the necessity of knowing traditional stroke writing order or rules. Second, different people from different places or with different educational backgrounds may be familiar with different stroke orders or rules. Accordingly, it may be difficult to define a standard stroke order or standard rules. Third, many strokes of an ideographic character are not necessarily or easily identifiable with one of the defined basic stroke elements. For example, one person may identify a stroke as a dot element, whereas a different person may identify it as a horizontal element.
Because of such difficulties, conventional encoding may result in coded characters and text which are significantly inaccurate. In the worst case, decoded output may be virtually incomprehensible. Even in the best case, there may be significant errors which may cause unreliability, inconvenience, embarassment or poor decision making.