The present invention is directed toward a method and apparatus for data processing and word processing in the Chinese language, and more particularly by the use of a defined Phonetic Chinese Language, which avoids ambiguities resulting from homonyms and homotones.
Modern Chinese is primarily polysyllabic. Traditionally, each written Chinese word is made up of one or more ideograms, which are pictorial representations of a concept or thing. Each ideogram has a monosyllabic pronunciation. The use of monosyllabic words is insufficient, however, in the spoken language, since Chinese includes a large number of homonyms, i.e., words (ideograms in this case) that are written differently or have different meanings, but have the same sound. That is, a single Chinese spoken syllable can represent a large number of different ideograms and therefore a large number of different meanings. This makes it impractical to use monosyllabic words for oral communications.
To overcome this problem, an oral language has evolved which is primarily polysyllabic, wherein a plurality of ideograms are strung together to form a single polysyllabic word, which significantly narrows down the possible meanings of such word. As a result of the foregoing, oral Chinese is approximately 80% polysyllabic (75% bisyllabic). Modern written Chinese has followed the oral language with the result that in written Chinese, many ideogram compounds are used, which are polysyllabic
Approximately 8,000 ideograms are used in the modern Chinese language While the total number of ideograms is somewhat greater than 50,000, most are rarely used and do not occur in the everyday language. In 1981, the People's Republic of China set up a standard set of 6,763 ideograms which are to be used for telecommunications systems in China. As a result, a base of about 8,000 ideograms will handle most practical applications of Chinese language.
The use of ideograms enjoys a strong cultural bias in China and serves as a unifying force within the nation. For this reason, it is necessary that any word processing or data processing system be capable of generating Chinese ideograms as an output. The use of ideograms as a direct input medium is, however, impractical because of the large number of ideograms (about 8,000) that would be required on a keyboard. Also, since ideograms are not alphabetical, the task of processing and ordering ideograms is difficult and cumbersome. While it is important for data and word processing systems to output ideograms, and while such an output is sufficient for word processing purposes, it is insufficient for data processing purposes. Since ideograms cannot be alphabetized, it is impossible to place the ideogram output of any data processing system into alphabetical form. This hinders the creation of efficient dictionaries, telephone directories, personnel directories and other sorted or alphabetical listings. Thus, there is a need for a non-ideographic representation of Chinese that can be sorted, listed alphabetically, and so forth.
In an effort to overcome the foregoing problems, the Chinese government has developed an alphabetic representation of the Chinese ideographic language. This language, known as Hanyu Pinyin, is representative of the pronunciation of Mandarin (Peking Dialect). The Peking Dialect has about 400 distinct monosyllabic sounds. Pinyin relies on 25 letters of the English alphabet (v is not used) to phonetically represent all 406 of these sounds. Pinyin is successful in achieving this result on a purely phonetic basis. There are 21 consonant sounds and 16 vowel sounds (the sounds "i", "u" and "u" may be added to the other vowel sounds to achieve an additional 18 compound vowel sounds) in the Chinese language. Each of these sounds can be uniquely represented by a combination of one or more Pinyin letters. Thus, systems employing Pinyin for both input and output have led to improvements in word processing efficiency and convenience.
However, for generating ideogram output, a primary drawback of this system stems from the need to differentiate the large number of homonyms in the Chinese ideographic language. Assuming a base dictionary of some 8,000 ideograms, every Chinese syllable (corresponding to a single ideogram) has an average of 20 homonyms (since there are about 400 distinct sound syllables in Chinese) with the result that on the average, one Pinyin syllable indentifies 20 different ideograms. In some cases, the number of homonyms for a given sound exceeds 150.
Since the Chinese language is about 80 percent polysyllabic, and since only a limited number of combinations of ideograms are employed to form polysyllabic words, this problem can partially be overcome in computer applications by storing a polysyllabic Pinyin dictionary in computer memory. When a polysyllabic Pinyin word is entered, a limited number of possible corresponding combinations of ideograms are identified, and often a single combination of ideograms can be uniquely identified by the polysyllabic word. However, the use of a polysyllabic dictionary requires a substantially larger storage capacity than if a purely monosyllabic (ideogram) dictionary were utilized and also significantly increases the processing time of converting from the Pinyin input to the ideograph output. Even with the storage of a large polysyllabic dictionary, the predominance of homonyms in Chinese (approximately 40% of bisyllabic words have homonyms) prevents unique and unambiguous mapping between Pinyin and ideograms.
Since many ideographic words have the same pronunciation, and hence are mapped into a given phonetic Pinyin word, written Pinyin also has a large number of homonyms. Systems utilizing Pinyin as an input language generally require special forms of spelling, or require that a character be added at the end of a bisyllabic word to distinguish between homonyms. Other phonetic conversion systems require the operator to make manual selections from among a choice of displayed homonyms of individual ideograms or compound words.
Pinyin has additional major drawbacks, since it disregards the most fundamental characteristic of the Chinese language - the tone pinyin specifies only distinct vowel or consonant sounds, i.e., phonemes Every Chinese syllable also has a tone, i.e., an inflection or pitch pattern. The tone can have any one of the four pitch patterns illustrated in FIG. 1. As shown therein, the four tones are the first tone (1) which starts high and stays high, the second tone (2) which starts at an intermediate level and rises high, the third tone (3) which starts at a medium level, dips low and then rises high, and the fourth tone (4) which starts high and dips low.
The combination of a sound syllable and the tone associated therewith will be referred to hereafter as a tone-syllable. Every ideogram of the Chinese language, and therefore every syllable of the Chinese language, is pronounced as a tone-syllable.
Therefore, a tone-based system would have major advantages Providing sound information alone is not sufficient, because it does not provide the complete information required to properly pronounce an ideogram. Further, as explained above, a sound-based system must deal with the full set of homonyms for a given Chinese sound syllable, and can do so only unsatisfactorily, while a tone-based system need deal only with homotones (syllables which have the same tone as well as the same sound). By resolving at the homotone level, rather than the homonym level, the average number of ambiguities caused by more than one ideogram being represented by a given tone-syllable is reduced significantly. The reduction is about three-fold (only about three-fourths of the possible tone-syllables are used by the Chinese language).
Recognizing the problem of homonyms, some prior art publications have suggested that a meaning-indicating letter be added to each Pinyin syllable to identify the specific ideogram desired Since there are 25 characters in the Pinyin alphabet, 26 different ideograms can be identified by adding one of the 25 characters (or by not adding any character) to the end of a given syllable. This system has not come into significant use, since in the proposed systems the added letters have had no rational connection to the particular ideogram to be represented, and it is difficult, if not impossible, to remember which specific letter corresponds to each specific ideogram.
The deficiencies of a sound-based language were recognized in 1928 by Y. R. Chao, who proposed a phonetic system using the Roman alphabet. This system used a tone-indicating letter which was inserted in each sound syllable to indicate the tone of the syllable. The primary problem with this system is that the extraneous tone-indicating letter prevents the establishment of a meaningful alphabetical listing of the resulting words. It is also much more difficult to read, and does not permit a unique identification between its phonetic words and individual ideograms.
Summarizing the foregoing, Pinyin is deficient in two major respects: (1) it does not take tone into consideration, and (2) it cannot distinguish between homonyms. While modifying Pinyin or other prior art systems to include tone and meaning-indicating letters would alleviate these problems to some degree, this would create problems of its own since it would destroy the alphabetical nature of the language and make it very difficult to create a proper dictionary or other sorted listing. Yet another problem with the modifications to Chinese proposed by the prior art is that the number of letters required to identify a paraticular ideogram would be significantly increased, thereby reducing the readability of the language and making it very difficult to learn.
In any practical alphabetical system, each Chinese word (consisting of one or more ideograms) must be typed as a single string of letters. Words are separated by spaces. In the prior art systems, there is no method for dividing single polysyllabic words into their individual components, with the result that a polysyllabic dictionary must be stored, thereby increasing the memory requirements and processing time of the data processing or word processing system. Even if means were provided for separating the polysyllabic words into their individual component syllables, the prior art alphabetical systems do not achieve a one-to-one correspondence between the phonetic representations of ideograms and the respective individual Chinese ideograms themselves. Thus, the alphabetical representation will often identify a plurality of ideograms which must further be distinguished manually by the operator of the system.