1. Field of the Invention
The present invention relates generally to statistical machine translation of multilingual documents and more specifically to systems and methods for identifying parallel segments in multilingual document collections.
2. Description of the Related Art
In the field of statistical machine translation, large collections of training data are required to develop and implement systems and methods for translating documents. Training data comprises parallel segments which are documents or fragments that are literal, or parallel, translations of each other in two languages. Currently, there is a lack of sufficiently large parallel corpora for most language pairs. A language pair refers to the two languages used within the parallel corpora. Examples of language pairs include English-Romanian or English-Arabic.
Large volumes of material in many languages are produced daily, and in some instances, this material may comprise translational equivalents. For example, a news story posted on the World Wide Web (WWW) on an English-language website may be a translation of the same story posted on a Romanian-language website. The ability to identify these translations is important for generating large collections of parallel training data.
However, because news web pages published on a news website typically have the same structure. As such, structural properties, such as HTML structures, can not be used to identify parallel documents. Further, because web sites in different languages are often organized differently and a connection is not always maintained between translated versions of the same story, URLs of articles may be unreliable. Further, a news website may contain comparable segments of text that relate to the same news story, but the comparable segments or articles should not necessarily be identified as parallel documents. Comparable segments may be referred to as “noisy translations” of the sentences.
However, these comparable segments may include one or more parallel fragments that can be added to the training data even though the entire segment is not a parallel translation of a comparable segment. For example, a quote within a news article may be translated literally even though the rest of the document is merely related to a comparable segment in another language.
Current methods perform computations at a word level and do not distinguish parallel translations of documents from comparable documents. As such, these methods result in many false positives where a comparable document may be erroneously classified as a parallel translation.