Although the origins of the Internet trace back to the late 1960s, the more recently-developed Worldwide Web (“Web”), together with the long-established Usenet, have revolutionized accessibility to untold volumes of information in stored electronic form to a worldwide audience, including written, spoken (audio) and visual (imagery and video) information, both in archived and real-time formats. The Worldwide Web provides information via interconnected Web pages that can be navigated through embedded hyperlinks. The Usenet provides information in a non-interactive bulletin board format consisting of static news messages posted and retrievable by readers. In short, the Web and Usenet provide desktop access available to potentially any connected user to a virtually unlimited library of information in almost every language worldwide.
Information exchange on the Web and Usenet both operate under a client-server model. For the Web, individual clients typically execute Web browsers to retrieve and display Web pages in a graphical user environment. For the Usenet, individual clients generally execute news readers to retrieve, post and display news messages, usually in a textual user environment. Both Web browsers and news readers interface to centralized content servers, which function as data dissemination, storage and retrieval repositories. Since the Web and Usenet can provide information from sources located worldwide, users often specify language preferences via the Web browsers and news readers, which are sent to the content servers as part of each information request.
News messages available via the Usenet are cataloged into specific news groups and finding relevant content involves a straightforward searching of news groups and message lists. Web content, however, is not organized in any structured manner and search engines have evolved to enable users to find and retrieve relevant Web content, as well as news messages. As the amount and types of Web content have increased, the sophistication and accuracy of search engines have likewise improved. Existing methods used by search engines are based on matching search query terms to terms indexed from Web pages. More advanced methods determine the importance of retrieved Web content using, for example, a hyperlink structure-based analysis, such as described in S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Search Engine,” (1998) and in U.S. Pat. No. 6,285,999, issued Sep. 4, 2001 to Page, the disclosures of which are incorporated by reference.
Search engines operate in two capacities, which both require identifying the language in which the content is expressed. First, search engines collect information about potentially retrievable Web content and news messages. As information collectors, search engines gather information about Web pages and news messages available from content servers located worldwide. Language identification is necessary to properly catalog the gathered information in a standardized representation to facilitate efficient indexing, storage and retrieval. Language identification information, though, is limited and is often unreliable. For example, parameters specifying the top level domain, character set encoding, response message headers, and embedded hypertext tags can provide some indication of the language used in the attached content, but default parameter settings can incorrectly identify a language and can cause incorrect outcomes.
Second, search engines disseminate requested Web content and news messages. As information sources, search engines generally strive to provide the highest quality results in response to a search query. Determining quality, though, is difficult, as the relevance of retrieved Web content is inherently subjective and dependent upon the interests, knowledge and attitudes of the user. Quality can be improved in several ways. For instance, a search engine can provide results in a language best suited to the preferences of the requesting user. Similarly, a search can translate Web content into a preferred language, where possible.
A typical search query scenario begins with either a natural language question or individual keywords submitted to a search engine. The search engine executes a search against a data repository describing information characteristics of potentially retrievable Web content or news messages and identifies candidate results. Searches can often return thousands or even millions of results, so most search engines typically rank or score only a subset of the most promising results. The top results are then presented to the user, usually in the form of Web content or news message titles, hyperlinks, and other descriptive information, such as snippets of text taken from the results.
Known text-based approaches to identifying the language in which the content is expressed rely purely on the content itself typically using machine-based learning methodologies. Language identifying hints, such as top level domain and tags, are ignored. Moreover, known-text based approaches operate on content encoded using Western language character sets, which assume fixed length encodings consisting of one-byte-per-character. Multiple length and variable length character encodings, such as found in East Asian languages, such as Chinese, Japanese and Korean, are not supported.
Accordingly, there is a need for efficiently identifying languages for content, including Web content and news messages, using probabilistic pattern analysis. There is a further need for determining document languages based on arbitrary groups of information elements and text analysis.