Parallel bilingual corpora, as used herein, refers to textual data in a first language that is identified as a translation of textual data in a second language. For the sake of example, the textual data discussed herein is documents, but other textual data can be used as well.
When one document is a translation of another document, the two documents are referred to as parallel, bilingual documents. Therefore, a parallel, bilingual corpora refers to a corpus of data in a first language that is a translation of a corpus of data in a second language.
Within a set of parallel documents, sentences in those documents which are translations of one another are often identified. These are referred to as aligned sentences. Therefore, if a document in a first language coincides with a parallel document in a second language, and the sentences in the two documents are aligned with one another (in that a sentence in the first language is aligned with its translation in the second language) then the two documents are referred to as parallel, sentence-aligned, bilingual documents.
There is currently a wide need for parallel, bilingual corpora. For instance, such corpora are often critical resources for training statistical machine translation systems, and for performing cross-lingual information retrieval. Additionally, some such corpora have even been exploited for various monolingual natural language processing tasks, such as word sense disambiguation and paraphrase acquisition.
However, large scale parallel corpora are currently not readily available for most language pairs. Even in those languages where some such corpora are available, the data in those corpora are usually restricted to government documents or news wire texts. Because of the particular writing styles or domain-specific language used in these types of documents, these corpora cannot be easily used in training data driven machine translation systems or information retrieval systems, or even the monolingual, natural language processors discussed above, for a range of domains in different language pairs.
There has recently been a sharp increase in the number of bilingual pages available on wide area networks (such as websites). Therefore, some web mining systems have been developed to automatically obtain parallel, bilingual corpora from the worldwide web. These systems use uniform resource locators (URLs), and assume that parallel web pages are named with predefined patterns to facilitate website maintenance. Therefore, when these systems are given bilingual website URLs, they use the predefined URL patterns in an attempt to discover candidate parallel documents within that website. Content-based features are then used to verify the translational equivalents of the candidate pairs.
These types of systems have met with limited success. For instance, there is a wide diversity of web page styles and website maintenance mechanisms. Therefore, bilingual websites often use varied naming schemes for parallel documents, which do not conform to predefined patterns.
In addition, these URL pattern-based mining systems can be problematic with respect to bandwidth. These types of mining processes require a full host crawling to collect URLs before using predefined URL patterns to discover possible parallel documents. Therefore, these URL pattern-based systems often require high bandwidth, and high cost, and result in slow download speeds. Since even many bilingual websites have only a very limited number of parallel documents, a significant portion of the network bandwidth is wasted on downloading web pages that do not have translational counterparts.
In addition, due to the noisy nature of web documents, parallel web pages may include non-translational content and many out of vocabulary words. Both of these reduce the sentence alignment accuracy, even after two parallel documents have been identified. Further, conventional sentence aligners only operate on conventional text, without considering other factors, such as layout similarity.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.