Parallel bilingual corpora, as used herein, refers to textual data in a first language that is identified as a translation of textual data in a second language. For the sake of example, the textual data discussed herein is documents, but other textual data can be used as well.
When one document is a translation of another document, the two documents are referred to as parallel, bilingual documents. Therefore, parallel, bilingual corpora refers to a first corpus of data in a first language and a second corpus of data in a second language, wherein the second corpus is a translation of the first corpus.
Within a set of parallel documents, sentences in those documents which are translations of one another are often identified. These are referred to as aligned sentences. Therefore, if a document in a first language coincides with a parallel document in a second language (i.e., they are parallel), and the sentences in the two documents are aligned with one another (in that a sentence in the first language is aligned with its translation in the second language) then the two documents are referred to as parallel, sentence-aligned, bilingual documents.
There is currently a wide need for parallel, bilingual corpora. For instance, such corpora are often critical resources for training statistical machine translation systems, and for performing cross-lingual information retrieval. Additionally, some such corpora have even been exploited for various monolingual natural language processing tasks, such as word sense disambiguation and paraphrase acquisition.
However, large scale parallel corpora are currently not readily available for most language pairs. Even in those languages where some such corpora are available, the data in those corpora are usually restricted to government documents or newswire texts. Because of the particular writing styles or domain-specific language used in these types of documents, these corpora cannot be easily used in training data driven machine translation systems or information retrieval systems, or even the monolingual, natural language processors discussed above, for a range of domains in different language pairs.
There has recently been a sharp increase in the number of bilingual pages available on wide area networks (such as websites). Therefore, some web mining systems have been developed to automatically obtain parallel, bilingual corpora from the worldwide web. These systems use uniform resource locators (URLs), and assume that parallel web pages are named with predefined patterns to facilitate website maintenance. Therefore, when these systems are given a bilingual website URL, they use the predefined URL patterns in an attempt to discover candidate parallel documents within that website. Content-based features are then used to verify the translational equivalence of the candidate pairs.
These types of systems have met with limited success. For instance, there is a wide diversity of web page styles and website maintenance mechanisms. Therefore, bilingual websites often use varied naming schemes for parallel documents, which do not conform to predefined patterns. Especially, these systems cannot mine parallel documents located across websites (i.e. where the document in the source language and the document in the target language are located in different websites).
In addition, these URL pattern-based mining systems can be problematic with respect to bandwidth. These types of mining processes require a full host crawling to collect URLs before using predefined URL patterns to discover possible parallel documents. Therefore, these URL pattern-based systems often require high bandwidth, and high cost, and result in slow download speeds. Since even many bilingual websites have only a very limited number of parallel documents, a significant portion of the network bandwidth is wasted on downloading web pages that do not have translational counterparts.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.