Several modern languages are written with scripts that utilize symbols known as Chinese characters. These symbols, which are also known as Han characters or ideographs, originated in China several thousand years ago. In the modern languages which use the symbols, the literal designations for “Chinese characters” vary, such as Hanzi (Chinese), Kanji (Japanese), and Hanja (Korean). The “modern” forms of the Chinese characters have been in continuous use for more than 15 centuries. Earlier forms were in use more than 30 centuries ago.
The major languages that today utilize Chinese characters in their writing systems are Japanese, Korean and the numerous dialects of the Chinese language family (such as the more well known Mandarin or Cantonese). The Japanese and Korean languages share no common linguistic roots with the various Chinese languages, but Chinese character symbols were borrowed from the Chinese writing system and adapted to the Japanese and Korean systems when these were developed several centuries subsequent to the development of the Chinese system.
Virtually all of the spoken Chinese languages have evolved in the presence of the well refined Chinese writing system that has been in continuous existence for several millennia. The writing system and the spoken languages evolved together, each constraining in certain ways the evolution of the other. In the case of Japanese and Korean, however, the spoken languages evolved to something close to their present form without a writing system and completely independent from the evolution of Chinese languages.
All languages have fundamental units called words although what precisely constitutes a word in a particular language is often a subject of debate. In most writing systems, space is used to separate words. This practice of separating words with spaces makes the word boundaries very clear. In the Korean writing system that uses Chinese characters, this practice of spacing words is also used. However, in the Chinese writing system there is no space between words and the distinction between words and phrases is less clear.
Orthographies (i.e. writing systems) generally incorporate a combination of the following elements—(1) a symbology for writing the spoken words of the language, (2) a symbology for the punctuation of language elements, (3) a symbology for writing foreign words, and (4) a symbology for non-word symbols such as currency signs, trademarks, etc. The writing systems of English, Chinese, Japanese and Korean all contain these elements.
In the Chinese writing system, each Chinese character corresponds with a syllable of the spoken language. Words, however, may be comprised of one, two, three or more characters, and each character represents a separate syllable in the spoken form. Chinese words are often referred to as character compounds or phrases because of the sometimes blurry distinction between words and phrases.
Most of the spoken Chinese languages have evolved through the millennia to the extent that they are mutually unintelligible. Monolingual speakers of the Cantonese dialect, for example, cannot understand the spoken Mandarin dialect any more than they can understand English. The dialects, in essence, are completely different languages that merely share common roots. All of these Chinese languages, however, have evolved through the centuries in coexistence with a common system of writing with the principal of a correspondence between syllable and character. While literate Cantonese and Mandarin speakers may read characters with different pronunciations, they can achieve a common understanding of each others writings because Chinese characters symbolize a meaning independent of their phonetic enunciations. Of course the common understanding is tempered by differences in grammar and literary style which influence comprehension.
In addition to the method of writing Chinese words exclusively in characters, Chinese writings also contain several punctuation elements and many characters which act as modifiers. Foreign words are generally written in characters that are read with a similar Chinese “sounding” to the foreign word. Because these sounds are different in different dialects, written words formed in this way will not usually have the same properties as ordinary Chinese words.
In the Japanese and Korean languages, both of which adapted Chinese characters for their writing systems, the correspondence between syllables and characters is not universally present. A single Chinese character may be read as a multiple syllable Japanese or Korean word. Both written Japanese and Korean are mixed systems that use both Chinese characters and phonetic symbols developed uniquely and independently by the Koreans and Japanese. For example, a Chinese reader who does not understand Japanese may recognize a considerable fraction of the Chinese characters in written Japanese and have some clues to the meaning of the Japanese text, but not much beyond that. A similar situation exists with Korean with respect to both Chinese and Japanese.
The phonetic symbols in Japanese are a syllabary of sounds of spoken Japanese and are called Kana. Each symbol in the Kana is a complete syllable. This is possible because of the relatively small number of different syllables in Japanese. Kana may be used alone as words in conjunction with Kanji (Japanese use of Chinese characters—see page 1, supra) or as modifiers to other words written in Kana or Kanji. It is also possible to “romanize” the Kana by writing the Kana themselves with latin letters. Thus, written Japanese is a mix of Kana and Kanji with various words written in one, the other, or both.
The phonetic symbols in Korean are called Hangul. Hangul represents syllables of spoken Korean written as a composite symbol built from several phonetic components assembled inside an imaginary square block. Rather than writing a syllable as a linear sequence of letters, Hangul elements are combined into one composite symbol confined in the square block that represents a syllable of spoken Korean. Like Japanese, Korean writings can be a mix of phonetic Korean symbols and Chinese characters called Hanja (see page 1, supra). Unlike Japanese, much of ordinary Korean text avoids the use of Hanja, and Korean script is usually entirely phonetic.
While Chinese is always written with characters, phonetic systems have been developed to aid in the pronunciation and teaching of Chinese characters. Among these systems are the Chinese phonetic alphabet (also known as BoPoMoFo) which has become the standard phonetic system in Taiwan, and the Pinyin romanization which has become the standard phonetic system in the People's Republic of China. Both of these systems have been widely used for decades as an adjunct to teaching Chinese language and writing, but neither system functions as a writing system by itself. The Chinese phonetic systems have, however, been adapted as means of inputting Chinese characters into computers. Representative examples are described in U.S. Pat. Nos. 5,212,638 and 5,360,343.
Chinese Character Properties
Chinese Characters are orthographic symbols of several basic types which include pictographs, indicatives and various compound forms. Pictographs are essentially pictures that are often abstracted. Indicatives are form directions suggestive of meaning. The various compound forms include combinations of at least two pictographs or indicatives that together suggest a meaning. Other compound forms include those with elements that relate to the pronunciation and sound associated with the character. Such characters with phonetic elements are by far the most numerous.
There are many thousands of Chinese characters. The 2nd century dictionary by Xu Shen listed approximately 10,000 characters. Approximately 50,000 Chinese characters were cataloged in the seminal 18th century “Kang Xi” dictionary compiled by Kang Xi and his associates. Today, the majority of “fully” literate Chinese know a few thousand characters. These several thousand characters are used to write the tens of thousands of words used in modern Chinese writing.
Chinese characters are drawn by brush, pencil or pen from a repertoire of about 30 basic strokes. The complete character is drawn within an imaginary square box. Characters can vary from a single stroke to more than 30 individual strokes. From the 30 basic strokes, there are many variations according to size and position.
The more complex characters (which are the majority of all characters) are normally comprised of several sub-units where each sub-unit is a smaller or abstracted version of other characters. These sub-unit structures allow the Chinese to realistically deal with the thousands of characters available for writing. Most characters consist of 2, 3 or 4 sub-units from a set of only a couple of hundred basic sub-units. The 18th century Kang Xi organized characters using 214 of such sub-units which are referred to as “radicals” in the West. Characters are often related to each other through these sub-units some of which may indicate meaning (quite universally) or sound (in some dialects which may no longer be spoken). The radicals of a particular character are typically drawn as individual units. There are cases, however, where the drawing sequence is interrupted, for example, when a radical is drawn within another enclosing subunit.
In typical Chinese text, the frequency of occurrence of a particular character has an exponential distribution as shown in Table 1.
TABLE 1Most Common CharactersAccumulated FrequencyThe first one 4.0%Top 100 characters39.99%Top 500 characters75.86%Top 1,000 characters89.12%Top 2,500 characters98.49%Top 5,000 characters99.89%
Machine Input for Chinese Characters
The earliest known systems for the machine input of Chinese characters relied on arbitrary codes. For example, the Chinese telegraph code defines a selection of 9999 characters using a 4 digit numeric code. A highly trained individual would essentially memorize the entire coding of characters and key in the corresponding code. A distinct advantage of code based systems is that they readily allow blind operation, i.e. an operator that has learned the code can enter characters without removing his eyes from a source document, much like an accountant with an add-punch machine. The problem, of course, is the difficulty remembering such a massive set of code numbers. When the code for a character is not immediately known to the operator, a dramatic reduction in throughput results because of the need to consult some sort of reference.
The characteristics of latin-based writing systems are such that it was relatively easy to create a typewriter key system with one key for each of the 26 letters. The transition from manual typewriting devices to keyboards for computer input was a simple adaptation. In the case of Chinese characters, however, the need to accommodate the many thousands of characters has been problematic. For instance, U.S. Pat. Nos. 2,950,800, 4,379,288 and 4,951,202 describe specially designed machines and keyboards in attempts to establish a comparable means for encoding Chinese characters.
Another approach to entering Chinese characters is to use an intermediate system based on the sounds of the characters in the local language. In the case of Putonghua, the standard dialect of Mandarin Chinese within the People's Republic of China, there are about 400 distinct syllables if one ignores the tones. There are, thus, many characters with essentially the same sound, and there are also difficulties in distinguishing many of the sounds for those with an active dialect that is different from Mandarin. There exist also many cases in which the forms of rarer characters are known to an individual, but the pronunciation is not. Despite these difficulties, phonetic systems are presently the most popular forms of input and retrieval of Chinese characters for computer users. Representative examples are described in U.S. Pat. Nos. 4,500,872, 4,937,745, 5,255,189 and 5,319,552.
Attempts to adapt the western “QWERTY” keyboard to implement coding of Chinese characters are also known such as described in U.S. Pat. Nos. 4,684,926 and 5,187,480. A practical system for entering the thousands of possible Chinese characters without resorting to massive keyboards and new machines, complex and intricate systems for adapting western keyboards to Chinese character input by providing printed legends to replace the 26 latin letters, and the use of phonetics and intricate analog codes, would be a substantial benefit to those that require the use of written Chinese characters.
U.S. Pat. No. 5,109,352 describes call-up of characters based on (1) a classification of the basic strokes into a relatively smaller number of basic categories and (2) sequential entry of the stroke categories in the conventional order in which they are written. According to the teachings of the '352 patent, the number of strokes required to produce the desired character can be large. Although the order for writing strokes are consistent, significant differences do exist. For characters with a large number of strokes, the probability of a particular operator getting all of the strokes correct can be quite small. As the '352 patent teaches, storing alternative codings of the strokes for characters can mitigate these errors. This approach, however, if applied too frequently can reduce the effectiveness of the system by resulting in less uniqueness.
Additionally, an operator who does not know the proper order and classification for a particular character must resort to the awkward and time consuming process of trial and error. This can dramatically slow the overall average rate at which characters are inputted. This is particularly true for many situations where there is uncertainty in more than one stroke resulting in several possible permutations and combinations. In this case the operator may be “stuck” and be forced to consult a reference.
Another approach to the problem of entering Chinese characters are systems based on the use of radicals (as defined on page 7, supra). Two such systems are described in U.S. Pat. Nos. 5,119,296 and 5,197,810. These systems are based on decomposition of characters into their constituent structures, classification of radicals according to some rules or relations, and assignment of fixed locations on the keyboard for each radical, typically on multiple pages.
Such systems using radicals all have relatively complicated coding systems, rigid rules and inflexible keyboard assignments and often organized into the, above-noted, multiple pages. The radicals are normally drawn from the original 214 radicals of the Kang Xi dictionary, suitably modified to account for the simplification of certain characters that has occurred in the People's Republic of China. These radicals, designed as they were for the purpose of classification of characters, do not include all of the significant sets of forms drawn normally as a group. These factors make such systems difficult to learn and awkward to use.
None of the systems described in the prior art possess a completely satisfactory combination of ease of use, ease of learning the system and overall speed of text entry. A need exists in the art for a simple reduction of the massive nuances presented with constructing Chinese characters and their input into modern machines for today's users. The following objects are a solution to these unresolved problems still existing in the art.