This application relates in general to systems for encoding languages, and in particular to a computer system for encoding a collection of characters used in languages.
The English language uses its 26 letter alphabet to construct several hundred thousand words in left to right linear groups of varying lengths and combinations of letters. Each letter is associated with a sound. Chinese is as different as it could be. The most elementary unit in the language is said to be not an alphabet symbol but a "character"--a symbol which is usually equivalent to an English word.
Chinese characters are composed of various combinations of over 30 different pen-strokes. These strokes are not, by themselves, associated with any sound (as are the symbols of the English alphabet), and when combined to form a character, the combination is pronounced differently according to the dialect--even though universally read with the same meaning. While there are said to be as many as perhaps 50,000 characters, including ancient and very esoteric ones, most reasonably well-educated Chinese are familiar with 6000 to 8000 different characters. It has been estimated that 98% of written communication is done from a pool of only 3,000 different characters. This is not to say that most people use only a few thousand different "words", however. These characters are also commonly used in combinations of two or more to create more complex words or phrases. For example, the character for ten added to the character for moon forms a character which can mean Oct.
The major obstacle to creating a practical Chinese typewriter was that the strokes are not used in a linear fashion to construct characters, and the size or--in the case of some strokes--even their proportions vary greatly. Each character is constructed in an imaginary box of the same size as that of all other characters regardless of how many strokes are needed. This means that the same stroke can be of various lengths or various proportions as it is squeezed or elongated to fit into appropriate elements in various characters in the imaginary boxes. In order to construct a character, the needed strokes are all placed appropriately within the box--some must go in the middle, some left, some right, some on the bottom, some on the top, and some cut through the entire figure. In other words, there is no physical linearity, as with English in how the strokes are set down.
With these differences, it is not surprising that a practical typewriter, which is, after all a device built for linear, alphabetic languages, could not be successfully adapted to Chinese. With all the up and down and back and forth movements, as well as all the various sizes and forms needed for building a character from strokes or elements, an enormous keyboard would be needed to write with a one character or even a one element per key approach (requiring thousands of keys). To create a mechanical device with thousands of keys would be a formidable problem. Developments in computer technology have, however, greatly simplified this problem. It is no longer necessary to have to worry about size or shape or placement of individual strokes or elements or, alternatively, of creating a huge keyboard to accommodate thousands of characters. Matrix printers can print any symbol at all, including Chinese characters. Thousands of characters can be stored in computer memory. Once a character has been fetched from memory, it can be printed by a matrix printer. In other words, if a system for character retrieval is successfully designed for fetching characters from memory, it is no longer necessary to require a one-to-one correspondence between a keystroke and a mark on the printed page.
The issue remaining is how best to call out of electronic storage or memory the particular character to be printed or to be displayed. There have been many different approaches to solving this problem of efficient keyboard input. The most widely used input systems today for Chinese and Japanese are phonetic systems, which are time consuming to learn and to operate and are burdened with the problems of the subtle differences in pronunciation. The most prominent of these systems also require the use of an English keyboard and some familiarity with English pronunciation. There remains widespread dissatisfaction with these systems, however, and the search for a better solution continues.
A non-phonetic input method, and one of the earliest successful approaches was that of Wang Laboratories which, in 1979, made available an input system based on a 3-cornered analysis of 10,000 characters. Using the computer keyboard's number pad the operator keyed in 2 numbers for each of three corners of a given character; the number varied for each corner according to its configuration. This gave each character a 6 digit data storage or code which meant the proper character could be fetched from an electronic dictionary. There was some duplication in these data strings, about 6.6% of 10,000 characters; in these cases a choice then had to be made. Since the method required a great deal of memorization of corner shapes and their associated codes, the system is difficult to learn and tiring to apply.
The Wang system does illustrate an important difference between computer and typewriter: the computer does not require a one to one correspondence between a keystroke and a mark on the paper. Because of this, it is possible to use a string of symbols (data string) such as a string of numbers to represent the consecutive keystrokes of a character for fetching the character out of dictionary storage for printing or display.
A different approach to the input problem, also using data strings, is that of Li, disclosed in UK Patent GB2100899, and in an earlier system created by Jiang Zheng. In the case of Jiang, the immediate purpose was only the creation of a new method of organizing a Chinese dictionary, perhaps to be followed eventually by computer application. Jiang's system is described in pp. 379-387 of Character Indexes of Modern Chinese, by N. H. Leon, Scandinavian Institute of Asian Studies Monograph Series, No. 42, Curzon Press. Both Li and Jiang rely, at least in part, on numerically coding the strokes that make up a given character as the means for creating a data string that will uniquely identify the character. To do this, both bunch the 30 or so different possible strokes into a small group of stroke categories each of which is given a single digit code. Perhaps because of the nature of the strokes, there is at least partial similarity in the categories chosen. There are differences as well. Grouping the strokes into categories will, of course, make a smaller keyboard, and one that is easier to learn. But the main goal should be categories that are clearly distinct from each other; secondarily, they should not create a great number of duplicate data strings. In general, too few categories increase duplications; too many create unnecessary and often confusing fine distinctions. Li has eight categories, represented on a keyboard by eight keys. Six of these categories are groups of single strokes, (i.e. the pressing of the key for any such group will cause the numerical coding of one stroke to be entered), while two are combinations of strokes (i.e., the pressing of one of these keys causes two strokes to be entered; pressing the other produces a three stroke combination). These two combinations can also be constructed using the strokes in the six categories of single strokes. Thus the operator must be alert to not use the keys for single strokes to build these two combinations, but to use the two appropriate keys instead. Diagonal strokes are divided into two categories: left-falling and right-falling. Dots are included in the diagonals that fall to the right. Strokes with only one corner are divided into clockwise and counterclockwise; the categorization of strokes with more than one corner is unclear.
Jiang has six categories of single strokes, and, unlike Li, has no categories for combinations of strokes. His categories differ from Li's in two major respects: he separates dots from right-falling diagonals (they are in a category of their own) and he lumps "turning strokes", or strokes with corners, into a single category.
Jiang's system, the earlier of the two, creates its data strings by applying the system's stroke categories to a character's strokes as they are laid down in the usual writing order. If the first stroke to be written is in category 2, the first number in the string is 2; if the next is in category 6, the next number in the string is 6 (the string becomes 26), and so on up to a maximum of six strokes per character (plus the first and last strokes as tiebreakers if more than one character has that data string). The data string begins with the first stroke and ends with the sixth for characters having up to ten strokes, but the string begins with the sixth and ends with the eleventh for characters having more than ten strokes.
Li's input system uses a graphic or positional rule for deciding on the order of the digits representing the categories rather than a rule based on the order in which one writes the strokes. This is stated to have the advantage of permitting one who is not conversant with Chinese to use the input system, as well as the advantage of eliminating the problems that would be caused in his system by variation in stroke order among operators, and the advantage of reducing duplications. Li limits the maximum length of any data string of digits for a character to nine digits in cases of duplication of data strings; but the normal number of digits is expected to be six. Thus, for complex characters, this coding scheme requires the operator to discard certain strokes.
Both Li's and Jiang's stroke systems result in a data string that is usually no longer than Wang's (six strokes for most characters), and Li, at least, claims significantly fewer duplicate strings than Wang shows (1.2% vs. 6.6%). Like Wang, however, Li and Jiang have created short data strings at the expense of ease of operation. The central issue for the computer operator that results from any categorization of strokes is deciding the appropriate category for each stroke of a character; error, indecision, and delay easily result from any ambiguities. Li's categories create operator confusion by having not only six categories for strokes but two additional categories for combinations of some of those same strokes in the six categories. He also puts dots in the same category with right-falling diagonals and yet some dots are left-falling or vertical. And he makes no clear provision for handling strokes of more than one corner. Jiang's separate category for dots is no better, since dots can sometimes be confused with short diagonals. And creating separate categories for right-falling and left-falling strokes has been found in my research to increase operator error. Also, his lumping of all corner strokes together needlessly increases duplications.
An even more serious problem confronting an operator of Li's or Jiang's approaches is caused by their rules for skipping strokes in certain situations. In order to limit the data strings to six digits, Jiang requires that the operator count the number of strokes in the character before input begins if the operator is unsure whether there are fewer than ten strokes in the character; this is required because the operator must begin the string at either the first stroke or the sixth, depending on the total number of strokes. He must also count his input carefully so as not to enter less or more than six where appropriate. This is slow, error prone, and very trying.
Similarly, in order to limit the data string to six digits, Li's system, like Jiang's, also requires the operator to skip many strokes in any complex character. Where a complex character includes two or more roots, a maximum of three digits are allocated for each root; if a complex character has more than three roots, the fourth and higher roots are simply discarded. A person operating the input system would have to look up or have to be familiar with what strokes or roots should be discarded for complex characters. Like Jiang, Li also requires exactly six strokes for each character; Li's system is therefore also quite burdensome for the operator. Li requires the operator to use a positional rule for stroke order which is confusingly similar to traditional writing order, but which differs from such order and yields data strings different from strings obtained using the traditional writing order. Thus for operators to whom the traditional writing order has become second nature, Li's system requires such operators to un-learn the traditional writing order and replace it with Li's confusingly similar positional rule. Altogether Li's system is very difficult for the operator.
The systems proposed thus far for the input of ideographic characters such as Chinese characters are not entirely satisfactory. They are hard to learn and to apply, and are needlessly slow. It is therefore desirable to provide an input system where some of the above described difficulties are alleviated.
The Japanese language has common roots with the Chinese language. Many Chinese characters are used in the Japanese language, although some of such characters, known as Kanji, may be written slightly differently than their Chinese counterparts of the same meaning. In addition, Japanese also employs the Kana, a 46 symbol syllabary with two versions, Katakana and Hiragana. An input system adapted for entering Japanese must therefore be capable of entering the Kana as well as the Chinese characters. Several centuries ago, the Japanese arranged their Kana in a "50-sounds table". In time, the 50 sounds has been reduced to 46. Thus in some existing Japanese computer input systems, the 46 sounds can be entered through 46 individual keys on the keyboard, each key for one sound. However, because of the complexity of having to use 46 keys, such systems have not gained wide acceptance. The most widely used Japanese computer input systems convert the 46 sounds phonetically to English so that an English keyboard can be used for entering the Japanese Kana. This is not only slow, but is an impediment to potential Japanese operators who are not familiar with the English language. It is therefore desirable to provide an encoding system for Japanese Kana with improved characteristics.