The Chinese language is reported to comprise about 30,000 characters. Some 8,000 are listed in a commonly used Chinese-English dictionary, these being sufficient for modern Chinese prose. A vocabulary of about 3,000 characters accounts for 95% of the characters in every day use. Telegraph code books are limited to about 9,600 characters.
The 30,000 characters that comprise the writing system of the Chinese language are a heterogenous set, and were created at different stages of the development of the language. The pronunciation, in general, has been assigned arbitrarily to the characters, and the strokes from which the characters are composed have no syntactic meaning in themselves. The characters are not amenable to classification in a well structured system. The traditional Chinese dictionary arrangement method is in accordance with radicals and strokes comprising the character. The system has many deficiences. There are 214 different radicals listed in the Kang Hsi dictionary, and it is sometimes difficult to determine the radical group to which a character is related, especially when it is not a phonetic compound or a complex ideograph. Looking up a character is tedious and involves some six steps. Also, there is considerable degeneracy, with up to 30 characters having the same radical-stroke number characteristic.
A second system that is sometimes used is one wherein the four corners of the character are assigned a number in accordance with the stroke types and configuration of the strokes in that corner. The rules are relatively complicated, and mis-coding frequently occurs. Degeneracy is also a problem; for example, in the 2000-2999 section of the four corner code table in the Xinhua Zidian (New Chinese Dictionary) 1971, there are 1599 characters defined by 885 codes.
The Pinyin method of classification was introduced in Beijing (Peking) in the 1950's. The method involves a standardized phonetic system of representation using the Latin letters and tone indicators. The assignment of phonetic values necessitates a knowledge of the official dialect (Mandarin), and also subtle differences in the sound and tone must be discerned in many characters. Pinyin spelling of characters involves considerable degeneracy.
Other systems of classification are also known, serving for different purposes. The telegraph code lists some 9,600 characters numerically, thus avoiding degeneracy entirely. However, the list is accessed by operator search, or memory in the case of commonly used characters, hence the system is slow and requires considerable training. More recently, Caldwell, U.S. Pat. No. 2,950,800, proposed a system based upon the type of stroke from which the characters are constructed, and the sequence thereof. Some 21 "basic" strokes were identified. Some degeneracy was observed, but this was relatively small in comparison to the more traditional systems. Moreover, the method did not necessitate a fine knowledge of the Chinese written language or a particular dialectic manner of pronunciation, hence it could be open for widespread use.
Once a Chinese character is converted into a code signal, such signal may be employed in an information processing system such as communication, printing, translation and machine control. Thus, Caldwell described an electro-mechanical keyboard device for inputting the code elements into an accumulator. The concatenated code elements in the accumulator were then converted into X-Y coordinates so as to select and control the position of a film matrix upon which the preformed characters were stored, whereby the selected coded character could be optically printed. More recently micro-processor developments would readily permit the construction of electronic analogues embodying Caldwell's system, such as shown by Shashoua et al, U.S. Pat. No. 3,325,785. Still more recently in accordance with well known procedures, writing instructions converted from code signals may be for CRT, LED or "liquid crystal" display, or for printing such as impact printing, matrix wire printing, hot point printing or jet printing, for example. Also, whilst such instructions may relate to writing pre-formed character, they may relate to instructions for synthesising such characters. A simple synthesis was proposed by Li, U.S. Pat. No. 3,950,734 wherein a "prefix" and "suffix" were combined to form a character. More complex systems of synthesis in accordance with the stroke type and spatial configuration of the strokes are also known, for example as in the electronic system designed by Wakamatsu, U.S. Pat. No. 4,144,405, or in the various mechanical systems that have heretofore been proposed.
It is important to note here that the kinds of strokes for synthesis of the character for writing purposes are not well-defined. Most strokes are not known by name to the average Chinese writer, and the classification of such strokes into types is quite arbitrary. Whilst Caldwell defined and employed 21 such basic writing stroke types for encoding purposes, it has been recognized heretofore that a small number of stroke types would suffice for this purpose. A summary of different stroke types for coding systems which have heretofore been proposed is given by Stallings, "Pattern Recognition", Pergamon Press, Vol 8 pp 87-98 (1976). Cheung and Chan, in "Computer-aided instruction in Chinese characters" Proc. 1st Int. Symposium on Computers and Chinese Input/Output Systems, 599-616 (1973) identify some 31 different stroke types. Liu, in "Real Time Chinese Hand Writing Recognition Machine" MIT Cambridge, E.E. Thesis, 1966 identifies 19 stroke types. Yoshida and Eden, in "Handwritten Chinese Character Recognition by an Analysis-by-Synthesis Method". Proc. 1st Int. Conference on Pattern Recognition, 197-204 (1973) identify 7 stroke types, and Groner et al, "On-line computer classification of handprinted Chinese characters as a translation aid" IEEE Trans Elect. Comput. 16, pp 856-860 (1967) propose 5 types. The 7 stroke types and the 5 stroke types coding methods are referred to in greater detail subsequently herein.
A keyboard for encoding characters in accordance with stroke type and sequence may permit touch typing of the characters. Using his definition of 21 stroke types, Caldwell designed a keyboard with 21 "stroke keys", each assigned to one stroke type. However it was at once apparent that the speed attainable with such design would be, character for word, low in comparison to the average typing speed in English language on a Qwerty keyboard, the average strokes per character being about 10, and the average number of keystrikes per English word being about 5.
Caldwell reduced the number of keystrokes per character by two expedients. The first was termed "minimum spelling", whereby the length of the code word (that is, the sequence of code elements corresponding to the stroke types) for a character was truncated so as to just distinguish the character from other characters comprising the vocabulary list, whilst avoiding redundancy. For example, when an operator keyed in the code word BGD EGV BDP BDP BGE GE, the keyboard would lock after the seventh key had been hit, as the further information was not required to distinguish the character from the remaining characters comprising the vocabulary list. In an expanded vocabulary list containing the code word BDG EGV BDP BDP BGE GF, which differs from the above example in the last code element only, all of the code elements are required to avoid degeneracy. It is apparent that the applicability of "minimum spelling" in reducing the length of code words is very much dependent upon vocabulary size. The second expedient was the addition of "entity keys", which keys generate a signal corresponding to a sequence of strokes as opposed to one stroke. Some 20 different "entities" were described, each representing several strokes in specific spatial arrangement, often that of a radical or having other syntactical significance. From his relatively small vocabulary of 2,333 characters, Caldwell reported a reduction of the median value of 10.2 strokes per character to 6.7 using the "minimum spelling" method. When using a small sample drawn from the aforementioned vocabulary, Caldwell estimated the average number of keystrokes necessary to enter a character making full use of the "entity keys" was 4.7, which coincides quite closely to the average word length in the English language. However, after a suitable period of training, the typing speed on such a keyboard was reported by Caldwell to be only 14 characters per minute. Such typing speed is, of course, much less than is considered average for typing English words.
We consider that stroke coding systems for Chinese type characters which employ highly discriminating "basic" strokes have inherent disadvantages which tend to limit the attainment of good typing speeds. For example, certain strokes have a close resemblance to other strokes; this may be conducive to error in coding, and considerable effort must be expended on the part of the typist to distinguish between the types. Further, whilst there is no theoretical limit to the number of word encoding keys which may locate on a keyboard, there would appear to be a practical limit beyond which touch typing becomes increasingly difficult. As a first approximation it is not believed to be desirable to exceed the 26 letter keys of a Qwerty keyboard. Thus, whilst Caldwell identified some 20 different "entities", only 6 were assigned a key position, together with the 21 "basic" stroke keys. This restriction on the number of keys severely limits the applicability of the entity keys, since the percentage of characters of an expanded vocabulary list which may be encoded using the assigned "entity" keys is necessarily limited. Still further, in accordance with Information Theory, an optimal coding system should have a set code elements each of which is used an approximately equal number of times when coding an average text. A 21 basic stroke code system is far from optimal since, as stated by Caldwell, "90% of all Chinese writing is accounted for by only 9 of the 21 basic strokes". (op. cit.) The shortest uniform length binary signals that could be assigned to each of these stroke types would be 5 bits, and would be highly redundant. Hence, Caldwell employed Huffman's method of constructing minimum redundancy codes of non-uniform lengths (D. A. Huffman, "A method for the construction of Minimum-Redundancy Codes", Proc. I.R.E., 40, pp. 1098-1101, 1952). Such non-uniform length signals for code elements are used in serial transmission of information, and pose no problems for large computers with large accumulators. However in smaller information processing systems where the accumulators commonly have 8 to 16 bits, additional circuits and components are required before such code signals can be processed.