A regular printed basic unit of a writing system is a glyph. In English, a glyph is a letter or a punctuation mark. Any segment of written or printed discourse ordinarily appearing between spaces or between a space and a punctuation mark in English is a word. A word has unique semantics in English. A glyph in Chinese is shaped as a square unit. In Chinese, a glyph with unique semantics in written or printed discourse is a character, sometimes referred to as an ideographic symbol. A glyph without semantics with only one mark or dash made by a single movement of a writing device is a stroke. A radical is a partial character. Some glyphs of the same stroke pattern are both a character and a radical, where they differ only in the overall circumference, and are referred to as stand-alone radicals. A radical may contain other radicals; the contained radicals are sub-radicals. In English, the basic building block of the writing system is the letter, and there is only one way to lay it down in a discourse. In Chinese, the basic building block of the writing system is the character. With the number of characters well into the thousands, typing a discourse using several thousands of symbols is far more difficult than typing in English. So the problem is how to represent characters with a small number of symbols in a fixed order linearly, so that it becomes possible to use a keyboard with about 55 keys to produce a Chinese discourse. Since the American invention of the typewriter, there have been countless attempts to design a better linear form for Chinese characters, as in the paper A Solution to the Ideographic Character Identification Problem, by George K. Kostopoulos and the PINXXIEE Formula, by Wen Tien.
In this document, a few conventions are used to describe a character. A pair of round parentheses, ( ), is used after each ideographic symbol to contain its pronouncation, with a digit in the pronouncation description to indicate its tone. For example, , (ma3), using the standard Pinyin pronunciation system with the third tone, Dip.
Three basic approaches are used in such a linear representation: phonetic based, stroke based and radical based. The phonetic approaches are mostly based on an existing phonetic standard, such as Pinyin system of Mainland China, Zhuyin system of Taiwan, Katakana of Japan. Although there are objections to such a standard pronunciation of a language, some computer software packages are able to feature some dialects based on the standard systems.
The stroke approach to Chinese character search is rooted in history. A basic stroke set contains 4 to 10 different stroke patterns, where each pattern is composed of one to three basic strokes. For example, U.S. Pat. No. 4,684,926 has used 5 single-stroke patterns, PINXXIEE has defined 10 single-stroke patterns, U.S. Pat. No. 4,500,872 has used the Four Comer Code definition, which includes 4 single-stroke, 4 bi-stroke and 3 tri-stroke patterns. The standards are merging to 5 Stroke code and Four Corner Code.
For whatever representation methods are used, one fundamental question which cannot be avoided is how to break characters down into a manageable radical set. The stroke based methods have to deal with the issue, to make it is possible to organize many combinations of strokes. The phonetic based method has to deal with the issue to discriminate among homophones. Four problems are encountered in attempting to define a radical set. The first problem is how to decompose a character, that is, to decide which part of the glyph to consider as a building block, or root symbol. The second problem is to determine in what order these root symbols should be listed. The third problem is how to represent a root symbol and the fourth problem is deciding which parts of the representation to use in the encoding.
A typical radical set of an ordinary dictionary may include a single stroke pattern set of 5, a multiple stroke pattern set of about 50-60 and stand-alone radical set of 200-250, which totals to about 250-300 symbols. The majority of stand-alone radicals contain 4 or more single strokes. The U.S. Pat. No. 4,684,926 has devised two levels of radicals, the basic 5 stroke level and a root level. In the root level, four classes of radicals are defined: the key class has 25 stand-alone radicals, the stroke-root class has 44 members, the main-root class has 97 members and the derivative-root class includes 70 members. The total of the non-stand-alone radicals is about 120, which is twice as many as that of a typical dictionary. This is the reason for the shortened retention time and prolonged learning time for persons using the encoding system.
Conventionally, three objectives have directed efforts to achieve a satisfactory coding scheme. The first is to minimize the number of key strokes needed to express each ideographic symbol. The range of key strokes for each character is between 2 and 9. Two key strokes per character is readily attainable by trained operators. The second is to assure that no code sequence represents more than one ideographic symbol. The resulting encoding methods are often such that the shorter the average code length, the more encrypted the code, so that retention in human memory over time is poorer, as in the case of U.S. Pat. Nos. 4,379,288, 4,531,119, 4,684,926 and The Natural Code.
Since the first two objectives will produce encoding schema with shorter retention times, the third objective has been emphasized: make the coding rules simple. The simpler the encoding rules are, the longer the retention time will be. The more easily learned methods often can be retained longer in human memory, as in U.S. Pat. No. 4,872,196 and other phonetic based encoding, but each of the frequent occurrences of homophones requires the operator to stop typing and look for the correct character or word on the screen. After finding the desired symbols, the operator may either type the next key, or use a pointing device to select the correct entry from the screen. These incidents of typing--visual searching--selecting from the screen either with a pointing device or by typing the code are called session switches. The operator often has to type and select to choose the fight symbols, as is the case in U.S. Pat. No. 4,531,119. Another way to gain longer retention time is to design the key layout in a logical fashion, such as the effort made in U.S. Pat. No. 4,684,92.
Observing that frequent session switches reduces the speed of typing, the inventor introduces the fourth objective: minimize session switches in an encoding system. This objective has been used for Chinese dictionary indexing. For example, stroke number based indexing and phonetic based indexing are used in regular dictionaries. Stroke based indexing requires the operator to count the number of strokes. It is slow and inconsistent due to different ways of writing a character. Phonetic based indexing has very frequent session switches due to homophones, which U.S. Pat. No. 4,531,119 uses visual selection from the screen to resolve.
It is also noticed that all the prior encoding methods are based on coding one character at a time. U.S. Pat. No. 4,684,926 claims to include phrasal encoding, but its primary coding method is based on individual characters, and its phrasal encoding is extracted from its character encoding, and is limited to a small number of preselected frequently used phrases.