Identification of multilingual terminology can be seen as a process whereby a unit of text U1 (a word or sequence of words) in a source text T1 is put in correspondence with a related unit U2 in a target text T2 that is the translation of T1, such as U2 is the translation of U1. In the past, this process was a manual operation performed by human terminologists in order to build terminology databases. The automation of such a process is commonly referred to as alignment.
Alignment is usually performed through statistical methods. The article of Brown et al. (June 1991) titled "Aligning sentences in parallel corporal", Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, Calif., discloses a method wherein association scores are computed between the text units in different languages, and then the optimal combination of multilingual text units based on these scores is selected.
The drawbacks of such methods are that noise and silence are generated. Noise relates to multilingual associations which are found but are either wrong or not relevant, such as (dog,aboyer), where "aboyer" (to bark) is indeed related to dogs but is not a translation of the word "dog", while silence relates to some otherwise relevant multilingual associations which are present in the text but not found.
Furthermore alignment can be processed at different levels of the text depending on the size of the text units that are to be aligned, e.g. it can be done at the level of files, paragraphs, sentences, phrases, multiword terms or even single words.
Known systems that perform alignment of words or multiword terms generally rely upon the existence of texts that are already aligned at sentence level.
UK Patent Application 2,279,164 discloses a system for processing a bilingual database wherein aligned corpora (i.e. collections of texts) are generated or received from an external source. Each corpus comprises a set of text portions aligned with corresponding portions of the other corpus so that aligned portions are nominally translations of one another in two natural languages. A statistical database is compiled. An evaluation module calculates correlation scores for pairs of words chosen one from each corpus. Given a pair of text portions (one in each language) the evaluation module combines word pair correlation scores to obtain an alignment score for the text portions. These alignment scores can be used to verify a translation and/or to modify the aligned corpora to remove improbable alignments. The invention employs statistical techniques, and in particular embodiments allows a probability-based score to be derived to measure the correlation of bilingual word pairs.
However, this technique is limited to the alignment of single words, one word in the source language and one word in the target language. And it suffers the aforementioned problem of noise and silence related to the use of certain statistical scores.
Different methods have been proposed for the alignment at the multiword terms level. Gaussier et al., in "Some methods for the extraction of bilingual terminology", Proceedings of New Methods in Language Processing, Manchester, 1994, describe several alignment methods based on a monolingual identification of the multiword terms (e.g. by identifying words that have a high likelihood to be associated together), followed by the identification of biligual correspondences between these multiword terms through statistical scores. However, use of these methods is limited to terms composed of exactly two words in the source and target languages.
Some systems eliminating the aforementioned limitation use simple grammars in order to identify multiword terms in each language. For example, the paper of Gaussier et al. (1994) describes a system using linguistic patterns such as "adjective+noun" or "noun+preposition+noun" that characterize the structure of nominal terms in English and French.
While addressing the previous problem, the efficiency of such systems is not maximum and noise is generated because only a small portion of the noun-phrases thus identified turn out to be terms, i.e. units which express a concept of the domain. For example, the expression "following page" could be extracted as being a term in a "adjective+noun" grammar, while it is clear that this is a pervasive phrase in any technical text.
Furthermore, some silence is also generated since the scope of the linguistic patterns is limited to a certain number of expressions and will ignore certain structures that can yield terms, either because they are nonstandard word combinations (such as antenne parabolique de reception in French, where the adjective parabolique is masking the original noun+prep+noun, antenna de reception) or because the grammar failed to identify certain word part-of-speech due to the amibguity of certain words (for example microphone gain could be missed should the grammar consider gain as a verb instead of a noun).
Finally, among the cited problems of each method, none of the previous systems allow for the extraction of a one-to-many term alignment, such as for example the term "baseband" in English corresponding to the term "bande de base" in French.
Accordingly, it would be desirable to be able to provide a new system for automatically extracting multilingual terminology which eliminates the aforementioned problems.