1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular to the Unicode character handling in relational database management systems.
2. Description of Related Art
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. RDBMS software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Organization (ANSI) and the International Standards Organization (ISO).
In RDBMS software all data is externally structured into tables. The SQL interface allows users to formulate relational operations on the tables either interactively, in batch files, or embedded in host language, such as C, COBOL, etc. Operators are provided in SQL that allow the user to manipulate the data, wherein each operator operates on either one or two tables and produces a new table as a result. The power of SQL lies on its ability to link information from multiple tables or views together to perform complex sets of procedures with a single statement.
Inside a computer the text must be represented in digital form and thus a character set encoding routine must be used to encode each text character with a unique digital representation. In the USA the American Standard Code for Information Exchange (ASCII) 7-bit sequence is used for encoding text characters. In Europe it is the International Standard Organization (ISO) standard that is followed. In Japan, the dominant character encoding standard is the Japan Standard Association (JSA) standard. With the globalization of economies throughout the world, it became necessary to implement computer systems which support and handle multiple languages"" characters and encode them to digital form.
The Unicode standard was developed to provide an international character encoding standard to support the internationalization of software. Unfortunately, compared with ASCII characters, Unicode characters take more storage space in a file system supporting a DBMS database engine. In large database systems, as a database grows, a lot more space will be consumed in the Unicode format. This problem is very noticeable in pervasive devices, such as handheld computers running Windows CE, where it is very important to save space due to memory restriction.
The Unicode character encoding standard is a fixed-length character encoding scheme that includes characters from almost all existing languages of the world. Unicode characters are usually shown as a character string xe2x80x9cU+xxxxxe2x80x9d where xxxx is the hexadecimal code of the character. Each Unicode character is 16 bits (2 bytes) long, regardless of the language of the character. The International Organization for Standardization (ISO) and International Electrotechnical Commission (IEC) 10646 standard (ISO/IEC 10646) specify a 2-byte version of the Universal Multiple-Octet Coded Character Set (UCS-2), often used for Unicode encoding.
Also in use is the UTF-8 format (UCS Transformation Format 8), which is an algorithmic transformation that transforms fixed-length UCS-2 characters into variable-length byte strings. In the UTF-8, ASCII characters are represented by their usual single-byte codes, but non-ASCII characters become two or three bytes long. Thus, the UTF-8 transforms UCS-2 characters to a multi-byte codeset, for which ASCII is invariant.
Many conventional data storage products, such as the product sold under the trademark IBM DB2 UDB, can take either UNICODE (UCS-2) or ASCII user input strings and save them in storage devices in the UTF-8 format. The reasons for this conversion are several. One reason is to save space, because for an English-only system, each UTF-8 character takes only one byte, where an UCS-2 character would take two bytes. Another reason is consistency. Saving all different input string formats into one common format, such as UTF-8, makes data comparison or xe2x80x9cjoinxe2x80x9d possible without additional data transformation. Moreover, some systems, like DB2 Everyplace, have difficulties in handling strings internally when the strings contains value xe2x80x980x00xe2x80x99, which is common with UCS-2 strings. The UTF-8 strings do not contain value xe2x80x980x00xe2x80x99.
However, the conventional systems have two main drawbacks. For Asian languages, such as Chinese or Japanese, most characters will take three bytes in the UTF-8 format and only two bytes in the UCS-2 format. Therefore, the UTF-8 format requires bigger storage space than needed when using the UCS-2 format, which is a fixed 2-byte format. Thus, it is obvious that in a Chinese-only system, the UCS-2 format should be the format of choice instead of the UTF-8 format. Moreover, since each Chinese character may take two or three byes in the UTF-8 format, it is impossible for a designer/developer to predict how much space is needed to save a number of Chinese characters. That results is either not enough or too much allocated space.
Therefore, there is a need to provide a method and a system that can improve Unicode character string storage usage and string length calculation in database management systems.
The foregoing and other objects, features, and advantages of the present invention will be apparent from the following detailed description of the preferred embodiments which makes reference to several drawing figures.
One preferred embodiment of the present invention includes a software method for efficient handling of multiple Unicode formats in the same database on a table level. The routines of the method are used to create a plurality of database tables and specify each table data storage format, including a first table for storing data in a first Unicode format and a second table for storing data in a second Unicode format. The method inputs characters which are encoded in the first Unicode format. When the data should be stored in the second Unicode format, the method uses a conversion routine for transforming some inputted characters into the second Unicode format and stores them in the second table, and then stores unconverted inputted characters in the first table. The first Unicode format is preferably the UCS-2 format and the second Unicode format is the UTF-8 format.
Another preferred embodiment of the present invention is a system implementing the above-mentioned method embodiment of the present invention.
Yet another preferred embodiment of the present invention is a program storage device readable by a computer tangibly embodying a program of instructions executable by the computer to perform method steps of the above-mentioned method embodiment of the present invention.