Significant progress has been made in the area of statistical machine translation (Brown et al. 1993, Chiang 2005, Koehn et al. 2003), but one bottleneck in building a machine translation system with commercial quality is to obtain enough training data. It is well known that with the more data the better quality one can accomplish with statistical methods. With more training data, better word alignments can be achieved, and good word alignment quality results in good translation quality because the various models in a SMT system, such as, for example, the lexicon model, fertility model, distortion model and phrase table mainly depend on the quality of word alignment. However, it is not practical to collect millions of parallel sentences for rapid development of SMT systems.
This motivates the present invention in providing methods for improving word alignments, accordingly the translation quality can be improved with limited training data by employing manual alignment at the phrase level, extracting alignment patterns, and learning word alignment models from small amount of manually tagged data.