In recent years, use of computers has increased dramatically worldwide. As a consequence, users of computers receive a variety of text and data files from many sources around the world. Such text and data files may be downloaded from any suitable computer readable medium or from a remote server or web site on the now familiar Internet.
In text or data files, the primary data content is characters which include letters, numbers, punctuation marks, or other symbols or control codes that are represented to computers in a code which is meaningful to the computer. A character is stored in a numeric representation sometimes referred to as its code-point. There exist mappings that assign numeric representations to characters. These mappings are typically known as encodings, or sometimes code pages. For example, in the US-ASCII encoding system, the numeric code-point "97" is "lowercase a" and in Japanese Shift-JIS, the code-point "33,484" is the character "no."
Because different countries often have their own character sets and encoding systems, there are many different encoding systems. Each has its own set of rules. If a computer's code reader does not have the proper set of rules, it is impossible for the code reader (and the computer) to understand or correctly read the data. There is no way to place a byte data into context, making meaningful interpretation of the data very difficult.
In the case of the Internet, international HTML documents are very common on the World Wide Web. These international documents are often written in languages other than English and are therefore often encoded with encoding systems other than US-ASCII. In order for the documents to be read and displayed correctly by a user's computer, the browsing software (User Agent or UA) must know the encoding system used to encode the data contained in the document. Because this is a common problem, there are mechanisms in HTML for communicating the encoding system of a Web document to the reader of that document. However, these mechanisms have not been commonly used in practice. To counteract this problem, UAs typically include some limited form of encoding system detection.
Past encoding system detection methods typically have focused on Japanese encoding systems. UAs also typically provide a means for their users to choose (and thereby force) a specific encoding system for the documents they browse. This assumes, however, that users are familiar with encoding systems utilized to encode the data the are attempting to use.
Those encoding system detection methods do not fully attempt to read the data in the supported encoding systems. Instead they focus only on looking for invalid lead-bytes, invalid lead-byte/trail-byte combinations, or in the case of ISO-2022-JP (JIS), the special identifying byte sequences. No use has been made of additional information such as common character sequences or unmapped code-points. No attempts have been made to deal with ambiguous input.
Some statistical-based encoding system detection methods have been used which attempt to recognize common patterns of characters in text or data, for example, "es" or "the" in English. To configure such systems, statistics are typically gathered on a large sample of documents. Those encoding system detection methods compare the data read against the patterns represented by the statistics to determine likelihood of a match against a given encoding. The success, however, of detecting the encoding system in an arbitrary document depends on how closely that document resembles the data on which the statistics were gathered.
Therefore, there is a need in the art for an efficient method and system for detecting a particular encoding system from a variety of different encoding systems. There is also a need in the art for a method and system for detecting an encoding system by detecting invalid lead-bytes, invalid lead-byte/trail-byte combinations, and special identifying byte sequences in concert with detection of common character sequences or unmapped code-points. There is a further need in the art for a method and system for detecting an encoding system which resolves ambiguities between encoding systems and which uses statistical information about encoded data to augment encoding system detection.