It is often useful or necessary to determine which of several languages present in a document (e.g., a web page) is the primary language. Such documents are referred to as multilingual. This determination helps identify the relevance of a web page to a particular query. The task of an automatic language detection system is to identify the primary language (and additional languages, if present) of which a document is composed. A search engine uses the language composition of a document as one factor to determine how relevant the document is to a query. Some existing systems are designed to output a list of languages ranked by confidence in addition to the primary language, but they may not be able to specify which of the languages are actually present in a document.
These limitations lower the effectiveness of language detection for multilingual documents, because they may cause incorrect word-breaking. A word-breaker identifies individual words for a given language by determining where word boundaries exist based on the linguistic rules of the language. Language-specific word-breakers enable the resulting terms to be more accurate for that language. In a multi-lingual document, the primary language is determined, then a word-breaker for the primary language is usually applied to the entire document. This results in improperly word-breaking substantial non-primary language portions of the document.
All portions of a document are conventionally treated equally in determining the primary language of a document, which causes other limitations. However, in reality, certain portions of a document are more important or more informative than other portions of a document. As an example, a copyright statement is generally less informative to the document as a whole than the title. Giving the same weight to these different parts of the document could result in improperly assigning the primary language, particularly in shorter texts.