The present invention relates generally to character encoding systems and methods. More particularly, the present invention relates to systems and methods for determining the appropriate or best fit character encoding scheme for a set of data.
The use of computer networks, particularly the Internet, to store data and provide information to users is becoming increasingly common. The Internet is a loosely organized network of computers spanning the globe. Client computers, such as home computers, can connect to other clients and servers on the Internet through a regional Internet Service Provider (xe2x80x9cISPxe2x80x9d) that further connects to larger regional ISPs or directly to one of the Internet""s xe2x80x9cbackbones.xe2x80x9d Regional and national backbones are interconnected through long range data transport connections such as satellite relays and undersea cables. Through these layers of interconnectivity, each computer connected to the Internet can connect to every other (or at least a large percentage) of other computers on the Internet.
The Internet is generally arranged on a client-server architecture. In this network model, client computers request information stored on servers and servers find and return the requested information to the client computer. The server computers can store a variety of data types and provide a number of services. For example, servers can provide telnet, ftp (file transfer protocol), gopher, smtp (simple mail transfer protocol) and world wide web services, to name a few. In some cases, any number of these services can be provided by the same physical server over different ports (i.e., world wide web content over port 80, email over port 25, etc.). If a server makes a particular port available, client computers can connect to that port from virtually anywhere on the Internet, leading to global connectivity between computers.
For typical Internet users, the world wide web and email (smtp) have become the predominant services utilized. The world wide web was developed to facilitate the sharing of technical documents, but over the past decade the number of information providers has increased dramatically and now technical, commercial and recreational content is available to a user from around the world. The information provided through world wide web services is typically presented in the form of hypertext documents, known as web pages, that allow the user to xe2x80x9cclickxe2x80x9d on certain words and graphics to retrieve additional web pages.
When a user requests a web page, a program known as a web browser can make a request to the appropriate web server (usually after retrieving the IP address for the web server from a name server), the web server locates the web page and transmits the data corresponding to the web page to the client computer as series of ones and zeros (e.g., 00000010000001010001000000001100 . . . ). The web browser must transform the bytes received into recognizable characters for display to the user.
Character encoding schemes provide a mechanism for mapping the retrieved bytes to recognizable characters. In a character encoding scheme, a xe2x80x9ccoded character setxe2x80x9d is a mapping from a set of characters to a set of non-negative integers, with a character being defined within the coded character set if the coded character set contains a mapping from the character to an integer. The integer is known as a xe2x80x9ccode pointxe2x80x9d and the character as an xe2x80x9cencoded character.xe2x80x9d A large number of character encoding schemes are defined, many of which are defined by individual vendors, but no standardized character encoding scheme has been adopted universally. The lack of standardization is problematic because an integer that maps to the character xe2x80x9caxe2x80x9d in one character encoding scheme may map to xe2x80x9cI,xe2x80x9d a Chinese character, or no character at all in another character encoding scheme. If a web browser receiving web page data uses an incorrect character encoding scheme to display the web page""s contents, the contents may appear as unintelligible or meaningless.
In order to properly display a web page, a web browser must determine the appropriate character encoding scheme for that web page. This is typically done by reading a xe2x80x9ccharsetxe2x80x9d parameter in the content-type HTTP header of the web page or in a META declaration contained in the web page. Both these mechanisms, however, require that character encoding scheme be defined in the content of the web page itself. For web pages that do not provide this character encoding information, the web browser must attempt to determine the appropriate character encoding scheme through other mechanisms.
Existing web browsers such as Microsoft""s(copyright) Internet Explorer and Netscapes(copyright) Navigator attempt to determine the appropriate character encoding scheme (when the character encoding scheme is not otherwise defined) by defining subsets of character ranges that are unique or special to a given character encoding scheme. For example, the web browser may define 1-3 as corresponding to a first character encoding scheme and 6-9 as corresponding to a second character encoding scheme. If the integers received by the web browser are 4,5, and 8, more of these integers fit in the defined range 6-9 for the second character encoding scheme. Therefore, the web browser could chose that scheme. The web browser can then display characters based on the second character encoding scheme. This process can be inefficient because the web browser must test a large number of ranges and can be inaccurate as the ranges for various character encoding schemes can overlap. Moreover, many character encoding schemes do not use consecutive integers to encode characters and the character encoding scheme may not use a well-defined range of integers to encode characters, leading to the display of incorrect characters by the web browser.
The present invention provides a character encoding detection system and method that eliminates or substantially reduces disadvantages and problems associated with previously developed character encoding detection systems and methods. More particularly, one aspect of the present invention can be characterized as a method for determining an appropriate (or best-fit) character encoding scheme including the steps of (i) generating a set of reference characters based on a reference character encoding scheme and a first set of bytes; (ii) generating a set of test characters based on a test character encoding scheme and said first set of bytes; (ii) generating a set of test bytes based on said test character encoding scheme and said set of test characters; (iv) generating a set of comparison characters based on said reference character encoding scheme and said set of test bytes; and (v) comparing said set of reference characters to said set of comparison characters. In one embodiment of the present invention, the aforementioned steps are implemented as a JAVA based software program with Unicode (e.g., USC2) as the reference character encoding scheme.
In another embodiment of the present invention, rather than comparing a set of reference characters to a set of comparison characters, the present invention can compare the original set of bytes with the set of test bytes. This embodiment of the present invention can omit generating the reference characters and the comparison characters. Yet another embodiment of the present invention can generate a set of reference integers corresponding to the original set of bytes (and the reference characters) and a set of test integers corresponding to the set of test bytes (and the test characters) and then compare the set of reference integers with the set of test integers. Again, this embodiment of the present invention can optionally omit generating the set of reference characters and the set of comparison characters.
Regardless of whether the test bytes are compared to the original set of bytes, the test integers are compared to the reference integers or the comparison characters are compared to the reference characters, the present invention can determine the degree of match between the reference character encoding scheme and the test character encoding scheme. If the degree of match is within a defined threshold, that test character encoding scheme can be selected as the best-fit character encoding scheme or can be saved for comparison with other test character encoding schemes; if the degree of match is outside a defined threshold, the test character encoding scheme can be rejected, or can also be saved for comparison with other test character encoding schemes.
Embodiments of the present invention provide advantages over previously developed systems and methods for determining character encoding schemes because the present invention can more accurately test character encoding schemes, including those that do not use a continuous series of bytes (or integers) to encode characters.
Additionally, embodiments of the present invention provide an advantage by checking the accuracy of characters generated by a test character encoding scheme against the characters generated by a reference character encoding scheme that has a relatively large encoded character set.