This invention relates to a method and apparatus for selecting, storing and displaying Chinese script characters such as are used not only in the Chinese language but also extensively in the Japanese and Korean languages. The references hereafter to the Chinese language should be taken as including all other languages wherein Chinese script characters are employed.
The invention relates particularly to a word processing system for Chinese script characters wherein each character is selected from a large pool of characters according to predetermined rules which are natural to those familiar with the phonetic and graphical structure of Chinese. Moreover, the characters are stored in a manner adapted to economize substantially on computer memory.
There are several systems currently available for selecting Chinese script characters for input into a word processing system. In order to understand these methods, and more particularly their shortcomings, it is necessary to understand first the particular features of Chinese, as distinct from Western phonetic languages, which has tended to limit the accessibility of Chinese-speaking people to the use of typewriters, computers and word processing systems in general.
The conventional keyboard, with less than 100 keys, is designed for languages with phonetic scripts, such languages having a small set of graphic characters, i.e. letters. If such a keyboard were to be used in a corresponding manner for the direct input of Chinese script, it would require many thousands of keys since, unlike western phonetic languages, Chinese has many thousands of characters. Thus, the conventional keyboard is impractical for Chinese character input.
In order to understand the prior art methods for selecting Chinese script characters, the structure of the Chinese language must be understood. Chinese has a constrained phonetic structure. In spoken Chinese there are only 412 basic phonetic units, each having a monosyllabic sound. Four tonal patterns can potentially be applied to each phonetic unit, resulting in slightly more than one thousand distinct sounds. In comparison, phonetic languages, such as English, may have many thousands of distinct sounds.
Each distinct sound in spoken Chinese is a morphem, i.e. a semantically meaningful expression such as GO, SIT, MOON in English. In general, spoken Chinese does not have meaningless syllables, such as SEN, MIN, GA in English except for a handful of suffixes and affixes.
Whilst each distinct sound in Chinese is itself a meaningful semantic unit, in fact most phonetic units have several, sometimes even dozens of, different semantic meanings. The potential confusion in spoken Chinese, which could have resulted from the fact that nearly all such sounds each have many meanings, is solved in a unique way. The majority of Chinese words are expressed by a combination of two sounds, with each of the two sounds having its own meaning, and the double sound having a meaning which may be related or unrelated to that of the constituent parts. Although most Chinese morphems can have many different meanings, when two such morphems are combined, the resulting dimorphemic word is most often unambiguous.
The disyllabic structure of Chinese words has also influenced the structure of more complex expressions. Spoken and written Chinese has accumulated throughout its history a large number of phrases and idioms, which in many cases are vocalized by means of four distinct sounds and are written by means of four characters. In many cases these compound expressions are combinations of two disyllabic units, and often represent a complex semantic idea.
Whilst each phonetic unit and tone combination in Chinese can have typically many meanings, Chinese script characters have mostly a single or principal meaning. This is in marked contrast to Western, phonetic languages wherein ideas are usually communicated by means of a distinct vocal expression which is, in effect, character-encoded when those ideas are expressed in written form. In Chinese and associated languages, the idea is itself directly represented within the script and may therefore be interpreted regardless of the phonetic dialect of the reader.
Although most Chinese script is expressed vocally in a unique manner, several hundreds of the many thousands of Chinese script characters can be expressed vocally in more than one way. Additionally, some semantic units are sounded differently in colloquial speech than in literary expressions. Most such variations relate to the tonal pattern, although some relate to the phonetic unit itself.
Chinese script emerged from pictures depicting concrete objects. During the evolution of the Chinese script, new characters were formed by borrowing complete or partial images from existing characters, as well as by modifying the form of existing characters. Therefore, there are components and parts which appear in more than one character, although they do not always appear in identical form or location in each of the different characters. Moreover, the representation of characters in terms of their components does not follow systematic rules owing to the varied historical development of Chinese script.
Those components which appear in several characters are generally of two kinds:
(1) Phonetic indicators that were mostly borrowed from a formerly known character with a specific sound, and were used to express how the new character is sounded vocally, and
(2) Semantic indicators, or radicals, that were mostly borrowed from a known character with a certain meaning, and were designed to express the ideological source of the new character.
However, the exact meaning of a character cannot always be derived from the radical, nor do the phonetic elements give a clear visual indication as to how the character is sounded. Each character must be separately learned and memorized as a whole unit, comprising a combined shape, sound and meaning.
It will be understood from the foregoing, that a Chinese word is read not as a combination of images, as in Western phonetic languages, but rather as a unique visual template, whose complete form indicates a specific meaning. Some templates might share graphic components with other templates, but the combination of such parts in no way resembles the process in which letters are combined so as to form words in a phonetic language. Also, whilst most templates share a phonetic association with other templates, each template is uniquely associated with a distinct meaning.
Conversely, when writing phonetic scripts, letters are combined into phonetic strings in order to produce words. Each word will usually have a unique sound. In Chinese, a word is expressed by means of a unique, complete picture which has an associated sound, even though this sound will be common to many other characters.
In most languages, Chinese included, the frequency of usage of different words is widely varied. Some words are used very frequently, whereas others are encountered only rarely. Typically, a rather limited set of less than one thousand words constitute nearly 90% of the word count in spoken and written vocabulary, with some words each accounting for as much as 5% of the total word usage. This is true in Chinese for both monomorphemic and dimorphemic words constituted, respectively, by one or two characters.
Since Chinese script characters are used directly to represent ideas, different groups of people will utilize different characters in rather the same way that they will employ different vocabularies in phonetic languages. Thus, since an engineer uses a different vocabulary to that of a lawyer or doctor, he will also be familiar with a different set of characters over and above the basic general set. This situation is not encountered in phonetic scripts, wherein all words share the same limited set of letters and, at least phonetically, any word can be read.
The various prior art methods for selecting Chinese script characters in word processing systems are based on one or more of the properties of the structure of written Chinese explained above. Thus, for example, in the recognition and matching method, a character is selected from a huge static display in which all characters are shown simultaneously. A character is selected directly in the same way that letters are selected directly in phonetic scripts. The drawback of such a method is that it is difficult to identify a required character from such a large character display; the device is physically large; and even for highly trained operators the method is tedious and relatively slow.
In an alternative system, each character is assigned a numeric or Latin alphabetic code which is typed on a conventional keyboard. The code is then translated so as to select the corresponding Chinese script character. This method demands that the code for each character be memorized, and its use is therefore limited to highly skilled personnel.
In the reconstruction method, characters are recombined from their component parts which, as was explained previously, may be common to more than one character as a result of the evolution and development of the Chinese script. The drawback with such a method is that a large number of components (214 radical elements and 858 phonetic elements) is required to generate all Chinese script characters. Moreover, the components vary in shape and location within different characters, even further increasing the total number of graphic elements requiring representation. In one practical embodiment of such a system, a keyboard is provided having several hundred keys corresponding to every possible component. A character is generated by typing several strokes in sequence.
An adaptation of this method is the Chan Jie method which is used in personal computers. In this method, several dozen key components are assigned to the keys of a standard computer keyboard. A character is selected by entering a corresponding combination and sequence of these components. This process demands considerable training, since each manufacturer utilizes a different strategy for correlating the components with the small number of keys available.
Another variation of this method, developed in Taiwan, is the three-corner method wherein the components are assigned numeric as opposed to alphabetic codes. Each character is expressed by means of three 2-digit codes.
Underlying the Chan-Jie, the three-corner and all other reconstruction methods, is an attempt to use standard alpha-numeric keyboards to construct a character as a series of predetermined components. This is analagous to the construction of words in a phonetic script wherein the components are constituted by the letters spelling the word. However, this method of construction is unsuitable for Chinese, whose character set contains hundreds of components, each character being constructed from a small number of these components but not in accordance with a defined set of rules. Therefore, it is virtually impossible to devise a logically consistent method, which is also easy to learn, for constructing Chinese characters from their components, and this drawback is reflected in all of the reconstruction methods.
An alternative method for generating Chinese characters is by specifying the strokes from which each character is built. There is a limited number of basic strokes, each character being composed from between 1 and 33 such strokes, according to strict rules regarding the order of stroke entry. Therefore, it is possible, by specifying a small number of basic strokes, to display a relatively small group of characters in which the same basic strokes appear in the specified order. The desired character is then selected from this display. In one practical application of this method, only the first and last strokes of the desired character are input, all characters sharing the same first-last stroke combination being displayed for final selection. It is both unnatural and demands concentration to select a character by specifying its first and last strokes, particularly for those characters having a larger than average number of strokes. Thus, although this method of character generation is attractive in theory, being based on well-defined rules, hitherto proposed systems based on this method have been unsatisfactory.
In another method for selecting Chinese characters, the 412 phonetic units of spoken Chinese are represented by phonetic symbols. The Pin-Yin system, commonly used in the People's Republic of China, utilizes Latin letters to express Chinese sounds. By entering the phonetic sound in Latin code, a series of characters sharing the same phonetic structure is presented for final selection. One drawback of this method is that translation of the Chinese character into secondary script is requited prior to the final selection, and therefore exact knowledge of the translation codes and procedures is mandatory. Since there are many dialectic and cultural variations in expressions, it is not always possible easily to find the proper sequence of phonetic symbols needed to express the Chinese characters correctly. This problem is particularly manifest for the many half sounds in spoken Chinese which are difficult to express unambiguously in Latin code.
The method of selecting a character by specifying its constituent phonetic unit is considered by many to be the most attractive selection method, although hitherto proposed systems have so far been unsatisfactory.
Hitherto proposed methods for selecting Chinese characters mostly require an indirect character representation in the form of numbers, components, strokes or phonetic codes which must themselves be specified rather than the character itself. This requires that the large number of indirect character representations must be memorized exactly and, moreover, practical implementations of these methods are inflexible owing to the strictly defined codes which are entered by means of a keyboard containing a fixed set of keys The keys are normally limited in number to that of the standard alpha-numeric (QWERTY) keyboard. In order to map this small set of keys onto the large number of Chinese script characters, several keys must be typed in order to generate the code corresponding to a single Chinese script character. For example, in the three-corner method, six numeric keys are used for each character, whilst in the Chan Jie and the Pin Yin methods, the number of key strokes per character is approximately four or five.
Furthermore, the same set of keys and codes are used both for very common characters, which are encountered frequently, and for those characters which are encountered very rarely. This results in an inefficient utilization of the keys, a problem which is exacerbated by the different character distributions of various professions and activities.
A major drawback associated with hitherto proposed Chinese word-processing systems relates not only to the method of character selection, but also to the method of character storage within the system itself. The nature of this problem will best be understood with reference to the standard method of storing conventional phonetic characters which, in the English language for example, are limited in number to 26 lower and upper case letters. As has already been explained, the number of Chinese script characters which must be stored in a usable word processing system, is of the order of several thousand, and this requires a memory several hundred times greater than that required to store conventional phonetic alphabets.
The requirement for a large memory is due not only to the very large number of Chinese script characters, but also to the display format of the characters themselves. Chinese script characters are detailed and complex and require larger display formats than alphabetic characters. Thus, whilst alphabetic characters can be displayed on grids of 5.times.7 or 5.times.10 pixels, Chinese script characters require a grid of at least 24.times.24 pixels to be clearly legible. High resolution display is achieved only on grids of 48.times.48 pixels. Conventional methods of character storage in computer memories utilize bit-maps wherein each pixel is represented by a bit. The number of bits required to store a Chinese script character in bit-map formation is thus approximately 16 times greater than the number of bits required to store an alphabetic character (i.e., 576 compared with 35). Conventionally, many computerized Chinese word processing devices contain over 10,000 characters in their memory. But even a storage of only 5,000 common Chinese script characters in 24.times.24 bit-map formation requires 2,880,000 bits of memory, whilst the full Latin alphabet can be stored in less than 1,000 bits. If the grid be increased to 48.times.48 in order to achieve high resolution, a total of 12 megabits (or nearly 1.5 MBytes) will be required.
The resolution with which Chinese script characters may be displayed affects not only the memory requirements but also effectively limits the number of characters which may be displayed simultaneously. This is an important consideration for those word processing systems which employ methods of character selection based on the visual display of a predetermined sub-set of characters in memory, such as in the recognition and matching method, and in those methods wherein characters are selected according to their strokes or components.
Thus, a method for storing Chinese script characters at high resolution in a relatively compact format is not merely advantageous in terms of memory requirements but, more importantly, directly affects the effectiveness of the selection method itself. Such effectiveness is also dependent upon the use of a high-resolution display monitor.
A further drawback associated with hitherto proposed Chinese word-processing systems relates to the ease of copy typing. In an ideal system, a copy typist concentrates on the document to be copied, only periodically scanning the input text. In prior art Chinese word-processing systems, this approach is feasible only for those systems which employ mechanical keyboards with a limited number of keys for selecting Chinese script characters by entering a sequence of codes. For those systems which present information on a computer display monitor for selection by the operator, it is impossible to concentrate both on the input document and the display monitor simultaneously.