In the field of data-driven machine translation, it is desirable to obtain as much parallel data as possible about the language pair for which the translation system is built. Mutual translations of source and target language texts and text fragments are used as data to feed a learning engine, which builds models that are then used by an actual translation engine. Parallel texts, i.e., texts and text fragments that are mutual translations of each other, are an important resource in these applications.
Unfortunately, parallel texts are a scarce resource. Parallel texts are often limited in size, coverage and language. The parallel texts that do exist are usually from one domain, which may be problematic because certain machine translation systems trained in a first domain will not perform well in a second, different domain.
Certain textual resources which are not parallel may still be related in that they contains information about the same subject. Examples of such resources include the multilingual newsfeeds produced by several news agencies. Examples of these news agencies may include Agence France Presse, Xinhua News, and others. The same or similar news stories are often found in different languages. Therefore, while the texts may not be parallel—an Aljazera story about president Bush's visit to Europe may be written independently from a CNN story about the same visit, much information can be obtained from these comparable stories that can be useful in the context of developing translation systems.
A parallel text discovery system attempts to discovers pairs of sentences or segments which are translations of one another starting from collections of non-parallel documents. Previous research efforts have attempted to discover parallel sentences in parallel text. These techniques assume the parallel texts to be mutual, complete translations of each other and attempt to align all the sentences in these related text.
Zhou et al, “Adaptive parallel sentences mining from Web bilingual news collection” 2002 IEEE international conference on data mining, use a generative model for discovering parallel sentences between Chinese and English sentences. This and other comparable systems define a sentence alignment score and use dynamic programming to find the best sentence alignment between a pair of documents. Performance depends heavily on the degree to which the input consists of true parallel documents. If the method is applied, for example, to 2 documents of 20 sentences each that share only one sentence or sentence fragment, the techniques will not be likely to obtain useful information from sentences that convey the same meaning.