Various spelling systems or indexing systems have been attempted to represent ideographic symbols, such as Chinese, Japanese or Korean characters, Greek alphabets, mathematics symbols, and the like. For example, users of the Chinese language have been using the Pinyin system for more than 50 years and the Four Corner Numerical Index system for more than 70 years. The Pinyin system is a phonetic spelling system to both spell the sound and indicate the tone of each Chinese character. The Pinyin system can specify the pronunciation of every Chinese character. On the other hand, the Four Corner Numerical Index system has been used to index Chinese characters with one digit assigned to each of the four corners of the Chinese character based on the shape of the Chinese character. The rules for assigning a digit to each of the four corners are available in many Chinese dictionaries. A simple mnemonic song is also available in such dictionaries to help users to remember those rules.
Unlike the English-language, where a unique relationship exists between each spelling and its corresponding word, ideographic symbols do not always correlate to a unique spelling, if there exists such a spelling. For example, in the Chinese language, there is not a unique relationship between a Pinyin spelling and a specific Chinese character. The problem is known as the homotone problem in that many Chinese characters have exactly the same Pinyin spelling even after both the sound and the tone are specified. For example, the Pinyin spelling for the Chinese character “” (meaning “easy”) is “yi4” where “yi” represents the sound and the numeral “4” denotes the fourth tone. Among a set of 13,000 commonly used Chinese characters, there are 123 other different Chinese characters with different meanings but all are spelled exactly as “yi4”. About 98.7% of Chinese characters have the homotone problem under the Pinyin system. Similarly, the Four Corner Numerical Index system also has a problem in its inability to specify each Chinese character uniquely. For example, among the commonly used 13,000 Chinese characters, there are 73 different Chinese characters with the same Four Corner Numerical Index of “4422”. About 91.4% of Chinese characters have the non-uniqueness problem under the Four Corner Numerical Index system. Such non-unique relationship can lead to many serious problems when using the Chinese language in computers or e-mails as described below.
When inputting Chinese characters into computers using the Pinyin spelling method, a user may encounter following problems: (1) to stop at 98.7% of Chinese characters, (2) to scroll through and to stare very hard at several lists of many homotones on the monitor screen, and (3) to select the particular character that the user wants to input one Chinese character into the computer. In the worst case, to input the Chinese character spelled as “yi4”, the user has to stare at each of the 124 homotones displayed on the screen to find and to select a particular one that the user wants before the user can move on to try to enter the next character. The user must stare very hard at these lists of homotones to pick the desired character because many Chinese characters are very complicated, packing a large number (e.g., more than 50) of strokes inside a tiny space on the screen. This is obviously a very slow and painful process for inputting Chinese-language information into computers.
There has been a very strong demand and a large market for many software companies to develop and to sell more than 60 different methods and techniques designed to speed up this very painful and slow process of Chinese character input. The speed of Chinese character input using these special and tricky methods are proportional to the amount of effort and special training to memorize many illogical rules. It is obviously a heavy burden on the users to learn and remember such special and tricky methods.
Moreover, neither the Pinyin code nor the Four Corner Numerical Index alone is adequate to represent a Chinese character in computer usage because computer processing requires a unique relationship between the code and the word or character represented. Such deficiency forced the existing Chinese-language computer interface systems to encode many thousand (e.g., 13,000) Chinese characters directly. The direct encoding system makes it difficult to manage Chinese-language information in the computers because these many thousands of Chinese characters do not have any logical order. The information management functions, such as indexing, sorting, listing, organizing, searching and retrieving, of the Chinese-language based information have been difficult and inefficient both inside the computer and outside computer usage. For example, if the user sorts the names of the provinces in China by the current GB internal code, the sorting result cannot provide logical order of the province names.
For example, more than 90% of Chinese-language books have no index to help readers to find information in the book quickly. Some Chinese dictionaries and libraries provide an index system using (a) number of strokes of Chinese characters followed by (b) the radicals (i.e., the building blocks or roots) of Chinese characters. However, the maximum number of strokes of complicated Chinese character can be more than 50 and there are 217 radicals of Chinese characters. There are often very large number (e.g., more than 400) of Chinese characters with the same number of strokes. Such large groups of Chinese characters have to be further divided into smaller groups according to the 217 radicals. The logical sequence of such 217 radicals is nearly impossible for users to remember and is therefore very cumbersome and inefficient for practical use. Furthermore, for many complicated Chinese characters with 10 strokes or more, the number of strokes in each character is not easy to count. It is therefore burdensome for the user to figure out the correct number of strokes in such a complicated character. Chinese-language users have been struggling with the existing poor and inefficient index systems for many years.
English-language computer interface systems use the 26 English alphabetic letters, which are encoded by the 7-bit ASCII (American Standard Code for Information Interchange) code. The 128 possible combinations in the 7-bit ASCII code can accommodate encoding of all 26 upper case and 26 lower case English alphabetic letters, the 10 Arabic numerals, the commonly used punctuation marks and the necessary control characters. In the English-language computer encoding system, one overhead bit is added to the 7-bit ASCII encoded English information content to form an 8-bit byte. The leading bit in the 8-bit byte is set to the value of “0” to signal to computers that this 8-bit byte represents an alphanumeral in the remaining 7-bit.
In contrast, the 128 possible combinations of the 7-bit ASCII code are not big enough to code many thousand (e.g., 13,000) Chinese characters plus the necessary control characters. Therefore, the existing Chinese computer encoding systems use the 2-byte 16-bit encoding system to provide enough coding space to encode all the Chinese characters. The leading bit of the first byte of a 2-byte pair is set to the value of “1” to tell computers that each pair of such two consecutive 8-bit bytes represent a single Chinese character. Consequently, the leading bit of the second 8-bit byte in each pair is no longer an overhead bit but is a significant bit carrying Chinese-language information. The different encoding systems between the English-language and the Chinese-language can cause various problems as described below.
Most e-mail systems were originally designed for 1-byte encoded English language and many e-mail systems (but not all) often strip off the leading overhead bit of the 8-bit byte in their various e-mail processing functions. Stripping off the leading overhead bit is acceptable for English-language e-mails because the real information contents are in the remaining 7 bits. However, stripping off the leading bit in each 8-bit byte in Chinese-language e-mails causes the following two levels of fatal destruction of Chinese-language information content in such e-mails: (1) Each pair of 8-bit bytes representing a single Chinese character is cut into two halves and the e-mail system misinterprets each half as an English alphabetic letter, and (2) The leading bit, carrying Chinese-language information, of the second byte in each pair is stripped off and threw away by the e-mail system. The e-mail systems present a question mark for each destroyed Chinese character on the computer screen to the recipient of the Chinese-language e-mail. Consequently, the entire Chinese-language e-mail becomes meaningless (e.g., all question marks instead of Chinese characters) for the recipient. The recipient will not be able to recover or reconstruct the Chinese-language information content because these two levels of destruction are fatal.
The 2-byte 16-bit encoding problem exists despite the effort of upgrading various computers and Internet processors to the new international Unicode standard with 16-bit 2-byte encoding. Theoretically this is a simple upgrade operation. But practically the upgrade is not easy to complete because of the large number of computers, servers and processors used. Software engineers must search to find all 1-byte operations in large and complex software systems originally developed for 1-byte English-language operation. In a large and complex e-mail system having many different functions and branches where issues of 2-byte vs. 1-byte processing can be buried in many different places, it is not a trivial matter to find and upgrade all of the 1-byte operations. Thus, even in some supposedly upgraded e-mail systems, Chinese-language information can still get clobbered and destroyed. For example, although a Chinese-language e-mail may appear acceptable upon receipt, it may become illegible when the recipient presses the “Reply” or “Forward” button. This is because some 1-byte operations are still hidden in the large and complex software system and are triggered by pressing the “Reply” or “Forward” button. Even though the 2-byte international Unicode standard has been established and used for quite a few years now, such destruction problems of Chinese-language information still persist today. Furthermore, although the newer 4-byte 32-bit encoding system is considered to be able to accommodate all major languages, all Internet processors and e-mail systems in many servers and computers will have to go through another round of very long transition from the yet uncompleted worldwide 2-byte systems to the newer 4-byte systems.
At present, several different and incompatible encoding systems are being used to encode Chinese characters. If the Chinese-language encoding system in the recipient's computer is different from that in sender's computer, the Chinese characters in the received computer file often become blank square boxes, or strange symbols (e.g., Greek alphabetic letters) or wrong Chinese characters that appear normal on the surface but the real Chinese information contents are unreadable. Although such incompatibility problems do not destroy the Chinese-language information, they are very disturbing to the users and can greatly reduce the user's efficiency. Moreover, it requires advanced knowledge and skill of Chinese-language computer processing and special procedures to recover the Chinese-language information. For example, the user must change and cycle through many different sets of Chinese encoding systems in the computers to find the correct set to match the encoding system used by the sender. Further, special procedures vary depending on the application program being used, such as different e-mail systems (e.g., Microsoft Outlook, AOL, Yahoo, etc.), web browsers, Microsoft Word, PowerPoint, Excel, etc. It is nearly impossible to learn all the necessary skills to deal with the variations of special procedures to find the correct match of the encoding system.
There are other problems in processing Chinese-language e-mails or computer files in English-language operating systems. For example, if a computer file name contains 2-byte encoded Chinese characters, an English-language operating system cannot process such a file because the operating system does not recognize the file name and consequently cannot find such file. Special procedures are required to remove the Chinese characters from the file name before such file can be processed properly.
Moreover, many printer drivers are designed to process only 1-byte encoded English-language information in English-language operating systems. Such printers cannot process the 2-byte encoded Chinese characters but print them as blank squares. A Chinese software platform must be used on the English-language operating systems before the printers can print Chinese characters properly. Further, if a Chinese-language computer file contains tables or figures, the printed Chinese characters may not line up properly but appear in a chaotic fashion even when a Chinese software platform is used.
Moreover, some e-mail systems may convert the Chinese-language e-mail text improperly and display many pages of strings of computer internal codes that look like “&#65367;&#65353;&#65363;”. Advanced knowledge and special procedures are required to convert such computer internal codes to meaningful Chinese texts.
Additionally, if an English-language operating system is not equipped with Chinese-language support package software, the user must go through special procedure to download the Chinese language support package software from the relevant website or from the suitable CD. Otherwise, the user will not be able to use the English-language operating system to process Chinese-language e-mails or files or to surf Chinese-language websites.
The above problems have caused various inconveniences for users of Chinese language for many years. Much work has been done in an attempt to solve these problems, but no satisfactory solution has been found, which can easily be used, or has been available and accepted by the majority users of the Chinese language.
The present invention can overcome the above problems. The present invention provides a spelling system for various ideographic symbols. Moreover, the present invention provides a spelling system capable of uniquely spelling various ideographic symbols and a method for managing information represented by the ideographic symbols. Further, the present invention provides a spelling system capable of uniquely identifying an ideographic symbol. Furthermore, the present invention provides an encoding system for encoding various alphanumerical representations of ideographic symbols.