1. Field of the Invention
The invention generally relates to a method and apparatus for generating translations of natural language terms, and in particular to a corpus-based technique for translating unknown terminology in a specific domain.
2. Description of the Related Art
Presently, there are several important applications of terminology translation. The most common applications are Cross-Language Text Retrieval (CLIR), semi-automatic bilingual thesaurus enhancement, and machine-aided human translation.
In cross-language text retrieval, the goal is to be able to retrieve documents in response to a query written in a different language. The standard approach to CLIR is to translate the query into all possible target languages and then apply standard monolingual retrieval techniques. The most significant problem in CLIR is out-of-vocabulary terms, i.e. terms which do not appear in existing bilingual resources.
Semi-automatic bilingual thesaurus enhancement is needed because complete bilingual thesauri and terminology dictionaries do not exist in practice since new terms and new variants are always being created. Thus, it is a significant problem that bilingual thesauri are usually rather incomplete and not up to date. The same problem arises in performing machine-aided human translation.
Technical terminology is one of the most difficult challenges in translation. The ideal approach to translate such terms is to read extensively in the source language to understand what the new term means, then read extensively from similar material in the target language in order to discover the most appropriate translation. This is an extremely time-consuming process.
Therefore, several techniques have being developed for extracting translation equivalents from comparable corpora. Comparable corpora are sets of documents in different languages that come from the same domain and have similar genre and content.
All of these techniques represent words by term co-occurrence profiles. Term co-occurrence profiles have been used for monolingual applications, such as word sense disambiguation, and the use of term profiles generated by shallow parsers has also been explored for monolingual applications.
Comparable corpora are an important resource for cross-language text retrieval, and a number of methods for defining the similarity between terms in different languages have been developed, including similarity thesauri, latent semantic indexing, and probabilistic translation models. However, these approaches are all based on comparable corpora which are aligned at the document, paragraph, or sentence level. Aligned documents are translations of one another or are closely linked in some other way.