The exemplary embodiment relates to the development of parallel corpora for a translation tool. It finds particular application in conjunction with a method for identifying documents in different languages which can be used for training the translation tool.
In the field of data-driven machine translation, it is desirable to obtain as much parallel data as possible about the pair of languages for which the translation system is built. Mutual translations of source and target language texts and text fragments are used as data to feed a learning engine, which builds models that are then used by an actual translation tool. Parallel texts, i.e., texts and text fragments that are mutual translations of each other, are an important resource in these applications.
Statistical translation systems use parallel or comparable corpora in the source and target language for training the system. Parallel text discovery systems have been developed for identifying pairs of sentences or text fragments which are translations of one another starting from collections of parallel or comparable documents. Parallel documents are those which are intended to be mutual, complete translations of each other. Comparable documents are documents which, while not being complete translations, have considerable overlap in their content.
Examples of parallel documents are to be found in specific domains, such as product documentation, and in political and legal documents. The availability of such documents in these specific domains is largely due to the desirability for the documents to be accessible to people from different countries. Building parallel corpora (i.e., sets of matched pairs of parallel documents) in these domains is thus relatively easy. However, using such specialized corpora for training translation tools introduces a bias in the translation. As will be appreciated, the vocabulary in common usage among members of the European Commission, for example, may be inappropriate for training a translation tool designed for translation of children's fairy tales.
Accordingly, it is desirable to identify parallel or comparable documents from other domains to improve the statistical translation tool by reducing the bias introduced from one domain and/or making it more applicable to another domain. Two techniques, known as STRAND (Structural Translation Recognition for Acquiring Natural Data) and BITS (Bilingual Internet Text Search) have been used to find multilingual sites on the Web which offer the same content in different languages. These techniques typically search for URLs which are similar except for information relating to the language or country. For example, two URLs http://. . . /fr/doc.html and http://. . . /de/doc.html may be detected (where “ . . . ” matches and includes a reference to the document). These two URLs could be assumed to contain French and German versions of the same document, respectively. Where the name of the document in the two URLS is not exactly the same but is sufficiently close to raise an expectation that it is the same document in two languages, this may be confirmed by verifying that the associated images are the same or the lengths of the two documents are approximately equal.
However, the names of the documents listed in the URLs often do not match sufficiently closely to be identified as being parallel. Further, the URL may lack an easily recognized reference to the document's language. There exist a large number of parallel documents on the Web which are not easily identified as such because the two documents are posted by entirely different websites.
While parallel documents may be identified by following all references in a website in order to find a document in another language and then verifying if it is a translation of the initial one, such a process is computationally expensive because following links in a blind manner can lead to a huge space of document pairs to consider. Further, documents written in different languages hosted by different websites that are not related explicitly generally escape detection.
A need exists for an automated method for readily identifying parallel documents on the web which may be used for enriching a translation tool.