This disclosure relates generally to character encoding systems and methods. More particularly, the present disclosure relates to systems and methods of detecting character encoding in electronic documents.
In electronic documents, text is most often represented as a sequence of computer bytes or words of various bit lengths. The mapping that has evolved from text characters to the numeric codes is many to one. Furthermore, the representation of the numeric codes (encoding) is not unique. Mechanisms such as code pages allow the proper decoding of the numeric codes into the correct, corresponding text characters. To do this properly, both the character to numeric code mapping and the numeric code encoding scheme must be known.
The use of computer networks, particularly the Internet, to store data and provide information to users has become increasingly common. Client computers, such as home computers, can connect to other clients and servers on the Internet through a regional Internet Service Provider (“ISP”) that further connects to larger regional ISPs or directly to one of the Internet's “backbones.” Regional and national backbones are interconnected through long range data transport connections such as satellite relays and undersea cables. Through these layers of interconnectivity, each computer connected to the Internet can connect to every other (or at least a large percentage) of other computers on the Internet.
The Internet is generally arranged on a client-server architecture. In this network model, client computers request information stored on servers and servers find and return the requested information to the client computer. The server computers can store a variety of data types and provide a number of services. For example, servers can provide telnet, FTP (file transfer protocol), gopher, SMTP (simple mail transfer protocol) and World Wide Web services, to name a few. In some cases, any number of these services can be provided by the same physical server over different ports. If a server makes a particular port available, client computers can connect to that port from virtually anywhere on the Internet, leading to global connectivity between computers.
For typical Internet users, the World Wide Web and email (SMTP) have become the predominant services utilized. The World Wide Web was developed to facilitate the sharing of technical documents, but over the past decade the number of information providers has increased dramatically and now technical, commercial and recreational content is available to a user from around the world. The information provided through World Wide Web services is typically presented in the form of hypertext documents, known as web pages, that allow the user to “click” on certain words and graphics to retrieve additional web pages.
When a user requests a web page, a program known as a web browser can make a request to the appropriate web server, the web server locates the web page and transmits the data corresponding to the web page to the client computer as series of ones and zeros. The web browser must transform the bytes received into recognizable characters for display to the user.
Character encoding schemes provide a mechanism for mapping the retrieved bytes to recognizable characters. In a character encoding scheme, a “coded character set” is a mapping from a set of characters to a set of non-negative integers, with a character being defined within the coded character set if the coded character set contains a mapping from the character to an integer. The integer is known as a “code point” and the character as an “encoded character.” A large number of character encoding schemes are defined, many of which are defined by individual vendors, but no standardized character encoding scheme has been adopted universally. The lack of standardization is problematic because an integer that maps to the character “a” in one character encoding scheme may map to “I,” a Chinese character, or no character at all in another character encoding scheme. If a web browser receiving web page data uses an incorrect character encoding scheme to display the web page's contents, the contents may appear as unintelligible or meaningless.
In order to properly display a web page, a web browser must determine the appropriate character encoding scheme for that web page. This is typically done by reading a “charset” parameter in the content-type HTTP header of the web page or in a META declaration contained in the web page. Both these mechanisms, however, require that character encoding scheme be defined in the content of the web page itself. For web pages that do not provide this character encoding information, the web browser must attempt to determine the appropriate character encoding scheme through other mechanisms.
Existing web browsers such as Microsoft's® Internet Explorer and Netscape's® Navigator attempt to determine the appropriate character encoding scheme (when the character encoding scheme is not otherwise defined) by defining subsets of character ranges that are unique or special to a given character encoding scheme. For example, the web browser may define 1–3 as corresponding to a first character encoding scheme and 6–9 as corresponding to a second character encoding scheme. If the integers received by the web browser are 4, 5 and 8, more of these integers fit in the defined range 6–9 for the second character encoding scheme. Therefore, the web browser could choose that scheme. The web browser can then display characters based on the second character encoding scheme. This process can be inefficient because the web browser must test a large number of ranges and can be inaccurate as the ranges for various character encoding schemes can overlap. Moreover, many character encoding schemes do not use consecutive integers to encode characters and the character encoding scheme may not use a well-defined range of integers to encode characters, leading to the display of incorrect characters by the web browser.