The exemplary embodiment relates to the extraction of bilingual terminology from parallel corpora for a translation tool. It finds particular application as a method for aligning parallel corpora for use in cross-language text retrieval, semi-automatic bilingual thesaurus enhancement, and statistical machine translation systems.
Specialized bilingual terminologies are invaluable to technical translators for ensuring correctness and consistency in large translation projects. Usually terminologies are built by specialized professionals (terminologists) and by the translators themselves, through a manually-intensive process. Tools for automating, partially or completely, such a process would thus be very useful. Several methods for extracting multilingual terminologies from parallel document collections have been proposed.
Parallel text discovery systems have been developed for identifying pairs of sentences or text fragments which are translations of one another starting from collections of parallel or comparable documents. Parallel documents are those which are intended to be mutual, complete translations of each other. Comparable documents are documents which, while not being complete translations, have considerable overlap in their content.
Once the documents have been identified, the documents are aligned, first on a paragraph or section level and then at the sentence level. Automated systems have been developed for performing the alignment. Alignment of subsequences of words within the aligned sentences has proved more difficult since phrases which are formed from a contiguous subsequence of words in one language may be non-contiguous in the other. For example, noun phrases (phrases which include one or more words grouped around a noun) are typically contiguous in English, while prepositional verbs can form discontinuous subsequences.
Probabilistic models, such as Hidden Markov Models (HMMs), have been developed for modeling systems in which some variables are hidden, but which are assumed to be statistically related to observed variables. The HMM makes certain assumptions, including that the values of the hidden variables (states) depend only upon previous values of the hidden variables, that the value of each hidden variable is independent of the values of the other hidden variables, and that the values of the observed variables depend only on the current values of the hidden variables. HMMs have been used to model alignment of sentences at the word level.
Finite state models have been applied to many aspects of language processing, including parsing and machine translation. Finite-state models are attractive mechanisms for language processing since they provide an efficient data structure for representing weighted ambiguous hypotheses. They facilitate the composition of models which allow for straightforward integration of constraints.