The present invention relates to automated language translation systems. More specifically, the present invention relates to a scaleable machine translation system and architecture.
Machine translation systems are systems which receive a textual input in one language, translate it to a second language, and provide a textual output in the second language. Current commercially available machine translation systems rely on hand-coded transfer components that are both difficult and expensive to customize for a particular domain, and are also very difficult to scale to a desirable size. These disadvantages have limited their cost effectiveness and overall utility.
A variety of example based machine translation systems have been created to address these deficiencies. A number of such systems are described in H. Somers, Review Article: Example-Based Machine Translation, Machine Translation 14:113, 157, 1999. Some of these typical example based machine translation research systems have been built with an example base built from up to approximately 200 sentences. They have encountered a great deal of difficulty in scaling to a larger example base and the performance of the system suffers from this difficulty.
Other of the data driven systems described in Somers parse the inputs from the example base using different parsers, based upon the particular language of the input text. The dependency structures resulting from such parsing are thus different, based upon the language and the particular parsing strategy used. Therefore, comparing the dependency structures from one language to the next is difficult, if not impossible.
Such prior systems have also not been easily scalable. For example, in order to increase the number of sentences over and above, for example, 200 sentences or so, has been very difficult. This is because the prior systems have difficulty handling noisy input data. Instead, the input data has been required to be in a precise form, or it has been cleaned up, and placed in the proper form, by hand. Of course, this makes it very difficult to dramatically increase the number of sentences because of the intensive labor required to clean up the data.