Machine translation is a technology which uses a computer to translate one kind of words or spoken language into another kind of words or spoken language. That is to say, on the basis of theory about language form and structure analysis in linguistics, relying on mathematically established machine lexicon and machine grammar, using the great storage capacity and data processing ability of computers, the auto-translation from one language into one or more other languages is accomplished without artificial interference. Machine translation is a frontier applied science being introduced into many branches of learning such as linguistics, computer linguistics and computer science etc. In order to realize the translation function, the machine translation system must have the capacities of word analyzing, sentence analyzing, grammar analyzing, dictionary lexicon, collocation lexicon, word meaning analyzing and the language outputting. The machine translation system includes several types such as conversion type, knowledge and word meaning type, but the functions and properties of those types are comprehensively used in practice.
Currently used machine translation systems can give sentence level translation. For a given article, some of systems can select the proper meanings of the word by statically analyzing the context.
With increased popularity of the Internet and the World Wide Web (Web), the machine translation systems can not satisfy the need to select the proper meaning of the words on the Web only by statically analyzing the context. For example, when a user visits some Internet sites by using the web browser he/she can read the web pages which is written in HTML (HyperText Markup Language) and may comprise other files including GIF, JPEG or the like. Also, there are often many hyperlinks in the web pages. The hyperlinks are objects that connect the page to other pages. Thus, when translation systems try to translate the Web pages, they should not be limited to the static analysis within a context of the page. When there is more than one meaning for a word, the word is ambiguous. The process of determining one of a multiple meaning for an ambiguous word is disambiguation. When it is impossible to determine the appropriate meaning of a word on the basis of the context, the appropriate meaning can be selected by dynamically analyzing associated hyperlinked information. For example, a news web page contains some titles which have hyperlinks. One such hyperlink is “Clinton wins senate support as Kosovo strikes near”. Here, we assume that the source language is English and that the target language is Chinese. For the translation system, it is very difficult to determine that of several possible meanings, for example, of the word “strike” the correct one is Chinese “”. It may be “”, “”, or ″“”. If no other information is available, “strike”, as a noun, usually is translated as “”’. The text that is linked to is:
“President Clinton sought and won support from Congress for Military action against Yugoslavia just hours after NATO ordered air strikes that could begin as early as Wednesday”.
The multi-word “air strike” is contained in the above text. The multi-word “air strike” has only one meaning in Chinese: “”. The meaning of “strike” in the multi-word “air strike” is “”. Thus, from the meaning of the multi-word “air strike”, the meaning of “strike” in the title can be determined. In most cases, a word in the context of one topic has only one meaning. Our invention is based on such an assumption.
Existing machine translation systems can give sentence level translation, when determining the meaning of the words they select the proper meanings of the word only by statically analyzing the context of the sentence and can not improve the accuracy of the translation by dynamically analyzing related text. For the Internet users, such existing machine translation systems are not sufficient. In the above example, a user is interested in the Chinese translation “” but not “”, if the machine translation system gives a translation of the title as the topic “”, the user may not read further and thus miss the details about “air strike”.