Computers have long employed a variety of encoding schemes to represent various writing scripts/characters for computer data. As Internet usage spreads across the globe, there is an acute need to exchange information across language and regional boundaries. However, global information exchange has been hampered by the proliferation of different regional coding schemes.
When data is exchanged between two applications that support multiple encoding schemes, it has been necessary to correctly detect the encoding scheme with which the received data is encoded before the data can be properly utilized and/or displayed. Consider, for example, the situation wherein a computer receives data to be displayed in a web browser. In order to correctly display the data received, the browser initially tries to rely on the encoding information provided by the http server, the web page and/or the end user. This encoding information may be provided via a character-encoding menu, for example. Unfortunately, this type of encoding information is often missing from many http servers and web pages. Moreover, the typical user is generally not sufficiently technical to be able to always provide the encoding information via a character-encoding menu. Without this encoding information, web pages are sometimes displayed as ‘garbage’ characters, and users are unable to access the desired information and/or functionalities.
As can be appreciated from the above example, in order to properly display and/or analyze the content (word and/or sentences) of a received document, the encoding scheme of that received document needs to be ascertained so that the content can be decoded using the proper decoding scheme. In situations wherein the encoding scheme information is not explicitly provided, an automatic charset (encoding) detection mechanism that can accurately ascertain the proper encoding scheme for use with the received document is highly useful. With reference to the above-discussed browser example, many internet browsers have implemented their versions of automatic charset detection. With such an automatic charset detection mechanism, a web browser can make an educated guess as to the encoding scheme employed when the data is transmitted, and employ that encoding scheme to attempt to display the information received on the browser screen.
Another useful application of automatic charset (encoding) detection is in the area of anti-spam and content filtering of emails. Spam emails are generally bulk electronic unsolicited messages, which are sent by advertisers but tend to be universally detested by recipients. Spammers also tend to provide no information regarding the charset or may provide incorrect charset information. Some users may desire advance filtering of emails based on their contents for the purpose of, for example, properly categorizing or prioritizing the received emails. Content filtering may also be employed to prevent emails that contain offensive and/or malicious content from reaching users. Spam prevention and content-filtering are among the more desirable features offered to email users by email systems and providers.
To perform the anti-spam and/or content filtering function on an incoming email, the content of the email (e.g., words or sentences) needs to be analyzed to discern whether the received email is spam. Alternatively or additionally, the content of the received email may also be examined to determine the email's topic category (e.g., sports, social life, economics, etc.) and/or whether its content is offensive/malicious. Automatic charset detection of received emails renders it possible to perform the content-based filtering and/or analysis correctly or precisely.
Since the precision of the automatic charset detection mechanism is important, improvements in arrangements and techniques for performing automatic charset detection are highly desirable.