This specification relates to training statistical machine translation.
Nearest neighbor techniques attempt to identify points in one collection of points that are nearer to points in a second collection of points than can be accounted for by chance. For example, there can be two collections of points, X0 and X1, where the points of each collection are randomly distributed in a high-dimensional space. Each point of X0 is independent of X1 except that a point x0 in X0 is significantly closer to an unknown point x1 in X1 than governed by chance. The nearest neighbor techniques attempt to efficiently identify these paired points.
Statistical machine translation systems are used to automatically translate text in a source language to corresponding text in a target language. In particular, statistical machine translation attempts to identify a most probable translation in a target language given a particular input in a source language. For example, when translating a sentence from Chinese to English, statistical machine translation identifies the most probable English sentence given the Chinese sentence. In statistical machine translation systems, a corpus of parallel text is used for training. The parallel text includes text documents in one natural language and corresponding translated text documents in one or more other natural languages. In some machine translation systems, a corpus of known parallel documents are used. For example, United Nations proceedings are available, which provide parallel translations in six languages. However, collections of known parallel texts are limited.