The present invention relates in general to automated language identification techniques and in particular to automated identification of documents as not belonging to any language.
With the proliferation of computing devices and communication networks such as the Internet, an ever increasing amount of information is stored in the form of electronic documents. Such documents might be generated using application software such as word processing programs, e-mail programs, web page development tools, etc. Electronic documents can also be generated by scanning paper documents and employing optical character recognition (“OCR”) or other techniques to create an electronic representation of the content.
It is often necessary to search through a large collection of electronic documents to find information relevant to a particular question. For example, a number of search services provide interfaces via which users can search electronic documents that are accessible via the World Wide Web. In another context, discovery in civil litigation usually involves the production of massive quantities of electronic documents that the receiving party must sift through.
Electronic documents can exist in any human language, and search processes are greatly facilitated if the language of a document is known. For example, in the case of Asian languages, parsing the document into words is non-trivial as most Asian languages do not include a space character between words. Thus, it is helpful to determine which language such documents are in so that they can be correctly parsed into words. As another example, a character string or word might have different meanings in different languages, and search results are generally improved if the language of the documents is known.
A number of automated techniques have been developed to identify the language of a document. Many of these techniques fall into two categories: dictionary-based and n-gram based. In dictionary-based language identification, a “dictionary” is assembled for each of a number of candidate languages, often by analyzing training documents known to be in that language. The document is parsed into “words” (e.g., based on word-break indicators such as space characters and/or punctuation characters), and a frequency analysis is performed on the words to develop a frequency profile for the language. The dictionary for each language can be limited to a relatively small number of commonly occurring words (often short words, e.g., 5 characters or fewer) in that language. The language of an unknown document is determined by parsing the unknown document into words and determining a frequency profile for the unknown document. This frequency profile for the unknown document is compared to the profiles for the various candidate languages, and the language with the best match is identified as the language of the document. Dictionary-based techniques can work well for western languages but often fail with Asian languages, since the documents cannot be reliably parsed into words until the language is known.
In n-gram based language identification, the document is parsed into n-character units for some integer n, rather than into words. Typically, n is chosen to be a small number such as 2 or 3, and the n-grams overlap; thus, for example, the word “patent” can be parsed into bigrams (i.e., n-grams with n=2) as “_p”, “pa”, “at”, “te”, “en”, nt”, “t_”, where “_” denotes the space character. Using a set of training documents in each candidate language, an n-gram frequency profile can be developed for each candidate language. The language of an unknown document can be determined by analyzing the frequency of n-grams in the document and comparing to the frequency profiles of the candidate languages. Using n-grams, particularly bigrams, can significantly reduce the size of the language model, as there are typically fewer possible bigrams than words in a given language. In addition, n-gram analysis does not require prior knowledge of where the word boundaries are, making it particularly suitable for analyzing Asian languages.
Both techniques have usually assumed that the unknown document is in a natural language (which means, generally, a language as developed and used by human beings). In reality, some documents are not in any natural language. For example, program source code or computer scripts are generally written in a specialized computer language that may use words from a natural language but does not employ the grammar or syntax of natural language. Likewise, address lists, spreadsheets, and other data-oriented documents may be said to be in no natural language. Documents that are not in a natural language are referred to herein as “junk” documents. It is to be understood that such documents are “junk” only in the sense that they should not be identified as belonging to any natural language; the documents themselves may be of considerable value to particular searchers or reviewers of documents.
In related fields, there has been some interest in detection of unwanted messages (referred to as “spam”) in e-mail and in comments posted by users on interactive websites such as blogs. Detection techniques for e-mail spam generally rely on features such as source IP address, presence of suspect keywords, and the overall distribution of words, rather than language modeling. Some detection techniques for “comment spam” do rely on comparing a language model derived from the content being commented on with a language model derived from the comment itself. Such comparisons, however, may result in the rejection of legitimate comments (e.g., if the comment uses different words from the original content). They also do not consider the possibility of multilingual content and/or comments, which would also result in diverging language models. Further, these techniques would not distinguish between a comment that was in a different natural language from the original content and a comment that was not in a natural language at all.