It is common for modern computer systems to provide facilities for storing and processing data representing text. Bodies of data stored by a computer system that represent a textual document are referred to as "digital document representations." Digital document representations are stored in a computer system, like other data, as a series of values called "bytes." Text is converted to these byte values using a "character set"--a mapping between the identities of different characters, herein referred to as "character glyphs," and different byte values. Character sets, also referred to as "code pages," are generally defined by standards organizations, such as the American National Standards Institute ("ANSI") or the International Standards Organization ("ISO"). Some character sets, called "multiple-byte character sets," map each character glyph to a value comprised of two or more bytes. It is generally possible to correctly display the document represented by a digital document representation only where the character set used to create the digital document representation is known. Converting a digital document representation from its current character set into a different target character set is also typically possible only where the current character set of the digital document representation is known.
Text generally comprises a series of words each taken from one or more languages. Natural language processing tools, such as spelling checkers, grammar checkers, and summarizers, may be applied to such documents. In order to correctly process a document, however, these tools must be advised of the language or languages from which the words in the document are taken. For example, when a spell checker tool encounters the word "bitte" in a document known to be in German, it does not regard the word as misspelled. However, when the spell checker tool encounters the same word in a document known to be in English, it regards the word as a misspelling of the word "bitter." Some information retrieval tools, such as word breakers (which identify the boundaries between words) and word stemmers (which remove suffixes in order to match different words having the same root), also must be advised of the language or languages occurring in digital document representations upon which these tools operate. In addition to the needs of automated tools, knowledge of the language in which the document is written is useful to human readers, who may read only one or a small handful of the large number of languages in which documents are written, to determine whether they will be able to read the document.
Thus, it is generally desirable for each digital document representation that is stored to be accompanied by an explicit indication of the character set used to generate it and the language or languages from which its words are taken. Wile such information is stored for many digital document representations, especially those that have been created recently, it is unavailable for many other digital document representations. For example, many of the HTML documents available via the world wide web fail to identify their character sets and languages.
In the case of some digital document representations, information identifying the character set and language of the digital document representation has never been associated with the digital document representation. This is often the case where this information was originally implied by the identity of the computer on which it was stored. For example, this information is implicit in digital document representations originally created in a single-language, single-character set environment. When such digital document representations are moved to a computer system that uses several languages and character sets, or made available to such computer systems via a network such as the Internet, the character set and language of such digital document representations is unavailable.
For other digital document representations, information identified in the character set and language of the digital document representation was at some point associated with the digital document representation, but is not presently available. For instance, such information may be stored in a separate file that is at some point deleted. On the other hand, the information may still be in existence, but nonetheless be unavailable. For instance, the file containing the information may be inaccessible to the user or program trying to determine the character set and language of the digital document representation. Such information may further be accessible, but be in a format that is unintelligible to the user or program seeking to determine the character set and language of the digital document representation. Thus, for a variety of reasons, the character set and language of a digital document representation may be unavailable.
Because the language and character set needed to display and process digital document representations are frequently unavailable, an automated approach to discerning the character set and language or languages of a digital document representation, especially one that has reasonable storage requirements and is straightforwardly extensible to new character sets and languages, would have significant utility.