1. Technical Field
The present disclosure relates to locating bilingual web pages and more specifically to efficiently crawling linked documents to discover bilingual web pages and bilingual document pairs.
2. Introduction
Recently, there has been an increased interest for sources of professional quality parallel text in two or more languages for tasks such as machine translation and cross-language information retrieval. Although previous work addresses many aspects of this problem, including document pair selection, and sentence and word alignment, the problem of efficiently discovering bilingual data sources on large scale networks, such as the World Wide Web, has not been adequately addressed.
To make the search for parallel text more feasible, previous approaches rely on the assumption that parallel texts mainly occur within Web pages. Thus, the search for parallel text can include two steps. The first step is to locate bilingual sites, and the second step is to extract the parallel text from them. Previous approaches mainly focus on the second step and do not address the first step. Previous work restricts the crawler in a top-level Internet domain (TLD) expected to contain a high concentration of these sites. For instance, previous approaches focus the crawler in a particular TLD, such as .de, when searching for German/English language pairs.
Previous approaches detect bilingual sites by extracting the anchor text and image alt text and matching them with a predefined list of strings in the languages of interest. If the web page contains at least two matched links in the different languages it is a match. The main problem of this approach is in terms of recall because bilingual sites that use patterns not represented or recognized by the predefined list are not detected. Another solution for this problem verifies if there is bilingual text at pages of the top 3 or 4 levels of the web page by using a language identifier. This approach can be very costly in terms of storage, bandwidth, and/or processing because it may need to download a considerable portion of the web page to make its decision.
Along the same line, one existing approach obtains two sets of candidate sites by issuing queries as anchor: “english version” to a search engine, and then take the union. Another approach discovers document pairs by first selecting the top words in a source language document, translating these words and issuing them as a query to a search engine. The main limitation of these approaches is that they only rely on the search engine results to obtain the parallel pages. Because search engines restrict the total number of results per query and the number of requests, the rate of processing sites in this way is extremely limited.
Further, some previous approaches rely on hand-picked bilingual web pages, requiring significant amounts of human knowledge, time, and effort. These approaches do not scale well, add cost, and can introduce inaccurate information through human error. These and other problems exist which hinder the identification of bilingual web pages.