The present invention relates to character conversion methods, and in particular to computer-implemented character conversion methods and systems for multilingual data.
Conventionally, different regions have their own regional character encoding standards. For example, Big-5 is adopted as a character encoding standard in Taiwan, GB-2312 in China, and Shift_JIS in Japan. With the internationalization of commercial activities, an international enterprise may establish websites in different regions adopting different character encoding standards.
Due to the increase in data communication among different regions, character encoding globalization has become an inevitable trend. For example, Unicode provides an international encoding standard by assigning a unique number for every character. Unicode, officially called Unicode Worldwide Character Standard, is a system for interchange, processing, and display of written data of diverse languages. Unicode also supports classical and historical written data in multiple languages.
When source data encoded in a Unicode character set, such as UTF-8, communicates to a destination medium, e.g. a database, a file, or an application program, encoded in a regional character set, such as Big-5, data conversion problems occur.
FIG. 1 shows an embodiment of a character conversion system for different character sets. A source database 100 stores data encoded in UTF-8 and a destination database 102 stores data encoded in Big-5. If data 106 encoded in UTF-8 is transmitted from the source database 100 to the destination database 102, a converter 112 must be established in the destination database 102 for character conversion. If an application server 104 is provided in the destination side, a converter 114 must also be established in the application server 104 for character conversion. Thus, the data 106 from the source database 100 can be stored in the destination database 102 or displayed in the application server 104 correctly.
One problem of the character conversion system in FIG. 1 is that character converters must be established in every component of the destination side even if only a small amount of data is to be converted.
FIG. 2 shows an embodiment of another character conversion system for different character sets. A source database 200 stores data encoded in UTF-8 and a destination database 202 stores data encoded in Big-5. An agent 210 is established for character conversion between UTF-8 and Big-5. Data 206 encoded in UTF-8 is transmitted from the source database 200, converted by the agent 210 to data 208 encoded in Big-5, and then stored in database 202. If an application server 204 is provided in the destination side, the converted data 208 or the data stored in the database 202 can be used directly.
One problem with the character conversion system in FIG. 2 is that one agent must be established for every character conversion between two character sets, for example, UTF-8 to Big-5, UTF-8 to Shift-JIS, or UTF-8 to GB-2312, consuming considerable resources. Another problem is that UTF-8 and Big-5 character sets are not one-to-one compatible, such that some characters may be lost using a conversion agent.