Automatic language identification generally is the process of examining unlabeled data and determining the language or languages for any linguistic content it may contain. Examples can be found in research and in industry of automatic language identification as applied to varied data types, including speech data, images that may contain text, and textual data. Here we are concerned only with automatic language identification as applied to textual data.
Automatic language identification is commonly used for identifying the language used in an unknown document, for example a web page obtained from the internet. Many text document formats include mechanisms by which they may be manually labeled as to their language, but these mechanisms often are not used or contain unreliable information, so automatic language identification may often be needed. In many cases this is combined with the detection of the text encoding in use, since mechanisms for labeling encodings suffer from the same problems. Automatic language identification is often used in data mining applications, which may need to scan a large collection of heterogeneous documents; for example, Google is known to use automatic language identification as part of its initial processing phase when it reads web pages to be indexed.
Automatic language identification of this sort typically uses a combination of methods, notably methods based on gathering statistics about characters and combinations of characters, and dictionary-based methods using word lists from various languages. These methods are all fairly well known, and there is a significant body of research about them. Apple Inc. shipped an automatic language identification API with Mac OS X starting in 10.5.
Word processing and other text document applications often provide various features that depend on language, such as spelling and grammar checking, hyphenation, and so forth. However, these applications usually require that documents or portions of documents be manually labeled as to their language in order for these features to work correctly in general. Typically a default language will be chosen based on the user's preference, and text in any other language will need to be manually labeled; in general an arbitrary portion of text, as small as a paragraph, sentence, or single word, can be so marked.
Microsoft Word does not appear to use automatic language identification at all. Arbitrary portions of text may be manually labeled as to their language, and this language is used for spelling and grammar checking, and for various other processes, either immediately as the user types or subsequently when processing is requested. Microsoft Word is typical of most applications in its class in this regard.
Google Docs appears to use automatic language identification for spellchecking, but only on a whole-document basis; users may choose either a single language to be used for spellchecking an entire document, or “Auto”, and in the “Auto” case a single language is chosen automatically for the entire document. Google Docs apparently uses this language information only for spellchecking, and spellchecking is performed only when manually requested, not immediately while the user types.
A text system (“Cocoa Text System”) in a prior version of Mac OS X included an existing spellchecking feature that is similar in some ways to automatic language identification, referred to as multilingual spellchecking. When multilingual spellchecking is turned on, words are identified as correctly spelled if they are correct in any of the languages known to the spellchecker. However, multilingual spellchecking does not use automatic language identification to identify the language of the text from context before spellchecking; it merely assigns a misspelled word the language in which the last previous word was found to be correctly spelled. In addition, this existing multilingual capability applies only to spellchecking and not to any other feature.