This specification relates to digital data processing, and particularly to processing resources to identify languages that are relevant to the resources.
The Internet provides access to a wide variety of resources, for example, video or audio files, web pages for particular subjects, book articles, and news articles. Because the Internet connects all parts of the world, the resources' content are usually expressed in many different languages. Consequently, systems have been developed to determine the language in which a particular resource is written. These systems process the content of a resource, e.g., the text of the resource and/or the encoding of the text of the resource, to determine the language of the resource.
Determining a language of a resource is especially useful for search engine processing. A search engine selects resources in response to a user query that includes one or more search terms or phrases, and ranks the resources based on their relevance to the query and importance. When language data that specifies the language of a resource is available to the search engine, the search engine can use the language data in a relevance calculation. For example, a resource written in French will be more relevant to a query written in French that would a similar resource written in English, and thus the French resource will usually be ranked higher than the English resource.
A resource written in a particular language, however, may be of interest to users of different languages. Thus, systems have also been developed to identify the languages that are relevant to a resource. One example system determines relevant languages for a resource by associating with each incoming resource link to the resource a language of the content of the source resource of the link. For example, a web page may be the target of five resource links from five different source web pages. For each resource link pointing to the web page, the language of the content (e.g., text) of the source web page that includes the resource link is determined and associated with the resource link. If a percentage of all incoming resource links associated with the same language exceeds a threshold percentage, then the language is determined to be relevant for that resource.
Often, however, it is difficult to detect precisely the language of the resource or the languages that are relevant to the resource. Many resources have an insufficient amount of text or other content for conventional content-based language identification methods. For example, a fully framed website may contain no text on the home page, thus hindering conventional content-based language identification methods. Likewise, the home page of the web site may be the target of several incoming resource links, and each of the resource links may be associated with a different language. Thus, relevant languages may not be accurately determined from the incoming resource links.
Additionally, a resource can include text in many different languages. The presence of text in many different languages in the resource, however, does not necessarily mean that the resource is relevant to all of those languages.