Sino-Tibetan based languages, such as Chinese, are vastly different than Latin based languages such as English. The Chinese language does not contain an alphabet. Instead, the Chinese language comprises more than 60,000 individual characters. Each of the 60,000 characters has a different meaning. Knowledge of about 1,200 characters is sufficient to read a Chinese newspaper. Chinese college graduates know about 3,000 characters.
Chinese also differs from Latin based languages in the concept of a word. In Chinese, strings of characters do not contain spaces and the interpretation of where one word ends and another starts is entirely based on context. Chinese characters are very precise in meaning, pronunciation, and in the way they are written. If a Chinese character has characters added to it in a string, the meaning of the first character is enhanced, but normally it is not changed.
Chinese characters are always pronounced as a single syllable. There are no two-syllable Chinese characters. Each Chinese character has one of five fundamental sounds. These five fundamental sounds give a singing quality to Chinese because some characters are pronounced with high tones, some with low tones, and some with tones that are rising or falling. Tone is fundamental to the language and Chinese would not be readily understood without the tones. For example, the character “ma” can either mean “mother” or “horse” or a “question” depending the tone. In China many dialects are spoken. Spoken words are almost unintelligible from one dialect to the next. However, there is only one written Chinese. Written Chinese is understood by all dialects. Other Sino-Tibetan languages such as Japanese, Korean, and Vietnamese use several characters common to Chinese. However, these languages have no common written or spoken meaning, similar to the manner in which English, Spanish, and French use a common alphabet but are not otherwise interchangeable.
Following the Chinese Communist revolution in 1949, the Communist party made several changes to the Chinese language. First, the traditional method of writing Chinese from “top to bottom” and “right to left” was abandoned. The Peoples' Republic of China (PRC or mainland China) now follows Western languages and is written from “left to right” and then “top to bottom.” Second, a single dialect was chosen, Mandarin, which is now taught in all schools as the primary Chinese language. Third, the PRC altered about one quarter of the characters to reduce them to around seven lines or strokes. This form of Chinese is called “Simplified Chinese.” In the PRC, Simplified Chinese is now widely used, but the Republic of China (ROC or Taiwan) and Hong Kong still use the more elaborate form of Chinese called “Traditional Chinese.” The PRC also adopted the Hindu-Arabic numbering system used by most Western countries and the advent of the Internet is causing English to appear in many Chinese sentences.
The PRC also introduced “Pin Yin,” a phonetic version of Chinese to help young children learn the language. Pin Yin uses the 26 letters of the English alphabet plus 4 accents over certain vowels to indicate how the character should be pronounced. Pin Yin is normally used from about 4 years of age until around 7 years of age when the students are taught to use Chinese Characters. Pin Yin is also very helpful for tourists and businessmen to speak Chinese from phrase books. Additionally, Pin Yin is popular with computer users as it is the easiest way to enter Chinese characters from a keyboard.
In the computer, all Sino-Tibetan languages are represented by 16-bit characters, while English and the other Latin languages are normally represented by 8-bit characters. Traditionally, separate encodings were produced for each of the languages. English uses a 7 bit encoding called ASCII. ASCII encoding is included as the first seven bits of all the other encodings. European languages are normally 8 bit encodings and make use of the eighth bit for their special characters. Simplified Chinese uses GB2312 encoding and Traditional Chinese uses Big 5 encoding. A computer using Big 5 encoding cannot read computer code in GB2312. This multiplicity of encodings is confusing and there is no standardization between the different encodings. The Unicode consortium has developed a single encoding that incorporates all the major languages of the world. There is a strong movement to use Unicode and replace all the other encodings in computer applications. Unicode uses 16 bits for each character inside the computer. Unicode has 65,000 different characters and each of the major languages is mapped into a different section of this Unicode range. Consequently, Unicode can be used as a single encoding scheme for all of the world's languages.
Chinese characters are encoded entries which can be displayed in different font sizes. In other words, a computer may display the Chinese characters in different sizes similar to the method by which a computer displays English characters and words in different font sizes using ASCII. Changing the font size is very beneficial to students studying Chinese because the students may see the Chinese characters in greater detail.
Individual characters, letters, or symbols can be represented using different schemes within Unicode. Two of the most popular encoding schemes are UTF-8 and UCS-2. UTF-8 is a byte based Unicode encoding scheme which represents each character, letter, or symbol as one, two, or three bytes, each byte being eight bits. In contrast, UCS-2 is 16 bit encoding scheme which represents each character, letter, or symbol as 16 bits or four hexadecimal digits. One hexadecimal digit is equivalent to 4 bits, and 1 byte can be expressed by two hexadecimal digits. Table 1 below displays the difference between UTF-8 and UCS-2.
TABLE 1UCS-2 (Hexadecimal)UTF-8 (Binary)Description0000 007F0xxxxxxxASCII0080 07FF110xxxxx 10xxxxxxUp to U+07FF0800 FFFF1110xxxx 10xxxxxx 10xxxxxxOther UCS-2
A user may choose to encode using the UCS-2 scheme or the UTF-8 scheme depending on the user's expected needs. For example, when transmitting data from one location to another, or when storing data in a database, UTF-8 is the preferred encoding scheme due to the transmission efficiency and the storage efficiency inherent in variable byte stream length (i.e. 1-3 bytes, as shown in Table 1). However, when holding the same information in a memory, UCS-2 is the encoding scheme. Conversion functions between UCS-2 and UTF-8 are available as evidenced by United States Patent Application Publication 2003/0078921 entitled “Table-Level Unicode Handling in a Database Engine,” incorporated herein by reference.
Prior to the development of Unicode, a computerized character translator between Simplified Chinese and Traditional Chinese within the same encoding was impossible because of the inability of GB2312 code to understand Big 5 code, and vice-versa. If the user desired a computer-implemented translation, multiple encodings had to be used which did not permit simultaneous display of both forms of data.
Similarly, the prior art translation programs have been unable to display Pin Yin with the proper accents. Typically, these programs would use pictures in the form of gifs or jpegs to represent the characters. The accented vowels indicate the proper tone and are essential to proper pronunciation of Pin Yin. One technique that uses only the ASCII characters is based on adding a number after the Pin Yin word to indicate the accent as illustrated in Table 2.
TABLE 2NumberAccentDescriptionExamples1  Level Toneā ē ī ō ū2□Rising Toneá é í ó ú3{hacek over ( )}Falling Tone,{hacek over (a)} {hacek over (e)} {hacek over (i)} {hacek over (o)} {hacek over (u)}then Rising Tone4□Falling Toneà è ì ò ù5(None)No Change in Tonea e i o u
Thus, the prior art would display the word guó as guo2, the word mā as ma1, and so forth. The prior art hybrid version of Pin Yin is difficult for the beginning reader to understand because the reader must make a cognitive leap between the number and proper type and location of the accent. Therefore, a need exists for an automated method for translating between Simplified Chinese, Traditional Chinese, Pin Yin, and English. The need extends to a method for displaying the Pin Yin with the proper accent marks.
Moreover, a need exists for assisting students with learning Chinese vocabulary. Chinese textbooks typically contain a plurality of chapters covering different subjects. Each subject presents twenty to thirty Chinese vocabulary words which are related to the subject. The student then uses the vocabulary words by themselves, then in conjunction with vocabulary words from previous chapters. Because of the encoding limitations, a computer implemented process for assisting in the development of both Simplified and Traditional Chinese vocabulary has not previously been developed. Therefore, a need exists in the art for a computer-implemented method for helping a student learn Simplified Chinese, Traditional Chinese, accented Pin Yin and English.