A. Field of the Invention
Methods consistent with this invention generally relate to computer systems and, more particularly, to methods for transmitting multibyte characters in a network.
B. Description of the Related Art
The Internet is a composite network of networks that facilitates electronic communications between subscribers in virtually every comer of the globe. The World Wide Web (the xe2x80x9cWebxe2x80x9d) is a sub-network of the Internet organized to enable users to locate stored information. In general, the Web enables two computers, one called a xe2x80x9cclientxe2x80x9d and the other a xe2x80x9cserver,xe2x80x9d to communicate through Internet connections using a hypertext transfer protocol (HTTP). The client executes a xe2x80x9cWeb browser,xe2x80x9d or specialized software program, that allows the user to obtain information in the form of xe2x80x9cWeb pagesxe2x80x9d from the server. Developers utilize a software language referred to as the hypertext mark-up language (HTML) to create these web pages.
Many existing application programs allow users to take advantage of information on the Internet. Hotjava Views(trademark), for example, is a suite of application programs that provides users with e-mail, calendaring, name directory access, and Internet browsing capabilities all written in Java(trademark) programming language from Sun Microsystems, Inc. NameView(trademark) is an application program that enables users to view a name directory provided by an application within HotJava Views or downloaded from an existing directory database. The Java programming language is an object-oriented programming language that is described, for example, in a text entitled xe2x80x9cThe Java Language Specificationxe2x80x9d by James Gosling, Bill Joy, and Guy Steele, Addison-Wesley, 1996. Sun, Sun Microsystems, the Sun Logo, NameView, HotJava Views, and Java are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.
NameView users can search directory databases of information on a local server or other web servers connected to the Internet. To access a web server and obtain information using applications like NameView, a user enters information into an input form called a xe2x80x9crequest.xe2x80x9d A Common Gateway Interface (CGI) script is an application that receives information from the user and puts the requested information into HTTP format for transmission across the Internet. CGI is a standard protocol for exchanging information between servers and applications external to the server, such as those on a client. When the client passes the CGI script and HTTP request, the receiving web server executes the CGI script and sends the information specified by the request back to the client.
In some cases, the HTTP request is first encapsulated using a protocol such as xe2x80x9cMIME,xe2x80x9d which is a standard protocol for multi-media e-mail messages. The MIME protocol encapsulates the request in a file for transport and appends a header to an encoded form of the file. The header specifies certain information, such as the encoding method used, and requests the server to run a CGI script.
The globalization of the Internet has created a need for application programs that can operate in any location and using a wide variety of languages. Users in the United States may want to use applications such as NameView to search for the e-mail address of a friend in Japan. The directory database containing this information may be stored using the Japanese language and digital representations of Japanese characters. Languages with many different characters may be encoded using Unicode character encoding. Unicode is a 16-bit character coding system established by the Unicode Consortium. In Unicode, each character is represented by two bytes of digital information. In the American Standard Code for Information Interchange (ASCII) format, however, each character is represented by seven bits of digital code. English and other languages with a limited character set typically use the ASCII encoding format with each character represented by one byte, or eight bits of data.
Although an increasing number of software and hardware devices are manufactured for use with many different languages, many existing computer systems and application programs still support only 8-bit characters. As a result, transformation formats have been developed that translate characters into an 8-bit format. UTF-8 is an example of a variable-width or xe2x80x9cmultibytexe2x80x9d encoding format developed to support multilingual text. In UTF-8, standard ASCII characters are represented using only one byte that begins with a xe2x80x9c0xe2x80x9d. Non-ASCII characters, however, require two or even three bytes. The first byte of a UTF-8 multibyte character indicates the total number of bytes in the character. For example, the first byte of a two-byte character has high-order bits xe2x80x9c110xe2x80x9d and the first byte of a three-byte character begins with xe2x80x9c1110xe2x80x9d. All other bytes of a multibyte character begin with xe2x80x9c10xe2x80x9d.
Multibyte character encoding systems, like UTF-8, require fewer bits to store and transport, but present difficulties for some applications. One reason is that web servers use the length of the data string in processing the HTTP request. Applications that formulate HTTP requests typically perform a standard function that determines the length. Standard functions, however, typically assume that the string contains only ASCII-encoded information and that the data string length is equal to the message length which is true with strings encoded using standard ASCII characters. With variable-length UTF-8 encoded strings, however, the number of bytes in the transmitted data string will often be different than the message length. Standard functions, therefore, return an incorrect length which creates errors in processing at the web server.
Therefore, a need exists for a method of transmitting multibyte characters in a network that communicates to the server an accurate data string length even when using variable-length encoding schemes like UTF-8.
A method for transmitting data in a network consistent with the present invention comprises the steps, performed by a processor, of receiving a set of fixed-length characters; converting each fixed-length character into a multibyte character to determine a length corresponding to the multibyte characters; and transmitting the length and the multibyte characters.
In accordance with the present invention, as embodied and broadly described herein, an apparatus for transmitting data in a network comprises a receiver configured to receive a set of fixed-length characters; a converter configured to convert each fixed-length character into a multibyte character to determine a length corresponding to the multibyte characters; and a transmitter configured to transmit the length and the multibyte characters.
In accordance with another aspect of the present invention, as embodied and broadly described herein, a computer program product comprises a computer-usable medium having computable readable code embodied therein for transmitting data in a network, the computer program product comprising the steps, performed by a processor, of receiving a set of fixed-length; converting each fixed-length character into a multibyte character to determine a length corresponding to the multibyte characters; and transmitting the length and the multibyte characters.
In accordance with still another aspect of the present invention, as embodied and broadly described herein, a system for transmitting data in a network comprises means for receiving a set of fixed-length characters; means for converting each fixed-length character into a multibyte character to determine a length corresponding to the multibyte characters; and means for transmitting the length and the multibyte characters.