A. Field of the Invention
Methods consistent with this invention generally relate to computer systems and, more particularly, to methods for transmitting multibyte characters in a network.
B. Description of the Related Art
The Internet is a composite network of networks that facilitates electronic communications between subscribers in virtually every corner of the globe. The World Wide Web (the “Web”) is a sub-network of the Internet organized to enable users to locate stored information. In general, the Web enables two computers, one called a “client” and the other a “server,” to communicate through Internet connections using a hypertext transfer protocol (HTTP). The client executes a “Web browser,” or specialized software program, that allows the user to obtain information in the form of “Web pages” from the server. Developers utilize a software language referred to as the hypertext mark-up language (HTML) to create these web pages.
Many existing application programs allow users to take advantage of information on the Internet. HotJava Views™, for example, is a suite of application programs that provides users with e-mail, calendaring, name directory access, and Internet browsing capabilities all written in Java™ programming language from Sun Microsystems, Inc. NameView™ is an application program that enables users to view a name directory provided by an application within HotJava Views or downloaded from an existing directory database. The Java programming language is an object-oriented programming language that is described, for example, in a text entitled “The Java Language Specification” by James Gosling, Bill Joy, and Guy Steele, Addison-Wesley, 1996. Sun, Sun Microsystems, the Sun Logo, NameView, HotJava Views, and Java are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.
NameView users can search directory databases of information on a local server or other web servers connected to the Internet. To access a web server and obtain information using applications like NameView, a user enters information into an input form called a “request.” A Common Gateway Interface (CGI) script is an application that receives information from the user and puts the requested information into HTTP format for transmission across the Internet. CGI is a standard protocol for exchanging information between servers and applications external to the server, such as those on a client. When the client passes the CGI script and HTTP request, the receiving web server executes the CGI script and sends the information specified by the request back to the client.
In some cases, the HTTP request is first encapsulated using a protocol such as “MIME,” which is a standard protocol for multi-media e-mail messages. The MIME protocol encapsulates the request in a file for transport and appends a header to an encoded form of the file. The header specifies certain information, such as the encoding method used, and requests the server to run a CGI script.
The globalization of the Internet has created a need for application programs that can operate in any location and using a wide variety of languages. Users in the United States may want to use applications such as NameView to search for the e-mail address of a friend in Japan. The directory database containing this information may be stored using the Japanese language and digital representations of Japanese characters. Languages with many different characters may be encoded using Unicode character encoding. Unicode is a 16-bit character coding system established by the Unicode Consortium. In Unicode, each character is represented by two bytes of digital information. In the American Standard Code for Information Interchange (ASCII) format, however, each character is represented by seven bits of digital code. English and other languages with a limited character set typically use the ASCII encoding format with each character represented by one byte, or eight bits of data.
Although an increasing number of software and hardware devices are manufactured for use with many different languages, many existing computer systems and application programs still support only 8-bit characters. As a result, transformation formats have been developed that translate characters into an 8-bit format. UTF-8 is an example of a variable-width or “multibyte” encoding format developed to support multilingual text. In UTF-8, standard ASCII characters are represented using only one byte that begins with a “0”. Non-ASCII characters, however, require two or even three bytes. The first byte of a UTF-8 multibyte character indicates the total number of bytes in the character. For example, the first byte of a two-byte character has high-order bits “110” and the first byte of a three-byte character begins with “1110”. All other bytes of a multibyte character begin with “10”.
Multibyte character encoding systems, like UTF-8, require fewer bits to store and transport, but present difficulties for some applications. One reason is that web servers use the length of the data string in processing the HTTP request. Applications that formulate HTTP requests typically perform a standard function that determines the length. Standard functions, however, typically assume that the string contains only ASCII-encoded information and that the data string length is equal to the message length which is true with strings encoded using standard ASCII characters. With variable-length UTF-8 encoded strings, however, the number of bytes in the transmitted data string will often be different than the message length. Standard functions, therefore, return an incorrect length which creates errors in processing at the web server.
Therefore, a need exists for a method of transmitting multibyte characters in a network that communicates to the server an accurate data string length even when using variable-length encoding schemes like UTF-8.