The invention relates to automated natural language translation in which a source document having annotations is translated automatically into another language while preserving the annotations in the translation. For example, an HTML document in English can be automatically translated into an equivalent Japanese language HFTML document to allow a World Wide Web page to be viewed in Japanese while preserving the formatting and hyperlinks present in the original English language version of the page.
Various schemes for the machine-based translation of natural language have been proposed. Typically, the system used for translation includes a computer which receives input in one language and performs operations on the received input to supply output in another language. This type of translation has been an inexact one, and the resulting output can require significant editing by a skilled operator. The translation operation performed by known systems generally includes a structural conversion operation. The objective of structural conversion is to transform a given parse tree (i.e., a syntactic structure tree) of the source language sentence to the corresponding tree in the target language. Two types of structural conversion have been tried, grammar-rule-based and template-to-template.
In grammar-rule-based structural conversion, the domain of structural conversion is limited to the domain of grammar rules that have been used to obtain the source-language parse tree (i.e., to a set of subnodes that are immediate daughters of a given node). For example, given
xe2x80x83VP=VT01+NP (a VerbPhrase consists of a SingleObject Transitive Verb and a NounPhrase, in that order)
and
Japanese: 1+2= greater than 2+1 (Reverse the order of VT01 and NP),
each source-language parse tree that involves application of the rule is structurally converted in such a way that the order of the verb and the object is reversed because the verb appears to the right of its object in Japanese. This method is very efficient in that it is easy to determine where the specified conversion applies; it applies exactly at the location where the rule has been used to obtain the source-language parse tree. On the other hand, it can be a weak conversion mechanism in that its domain, as specified above, may be extremely limited, and in that natural language may require conversion rules that straddle over nodes that are not siblings.
In template-to-template structural conversion, structural conversion is specified in terms of input/output (I/O) templates or subtrees. If a given input template matches a given structure tree, that portion of the structure tree that is matched by the template is changed as specified by the corresponding output template. This is a very powerful conversion mechanism, but it can be costly in that it can take a long period of time to find out if a given input template matches any portion of a given structure tree.
Conventional systems translate annotations in text, such as part-of-speech settings, i.e.  less than VERB greater than ,  less than NOUN greater than , Hypertext Markup Language (HTML) and Standard Generalized Markup Language (SGML). Such systems however, often do a poor job of preserving in the translated version of the text, the original intent, meaning, and look of the annotations in the original document. In one such system, HTML and SGML markup is placed in a translated version of the text adjacent to the translated word that corresponds to the word in the original text to which it was adjacent. This manner of insertion often results in inaccuracies in the translated version of the text due to markup that does not properly apply to words in the translated text to which it is adjacent, or due to markup that should not have been carried through to the translated version of the text.
It is therefore an object of the present invention to provide a system and method for translating a source document in a first language to a target document in a second language while preserving the annotations that exist in the source document, and inserting the annotations in appropriate locations in the target document.
The automated natural language translation system according to the invention has many advantages over known machine-based translators. After the system of the invention automatically selects the best possible translation of the input textual information and provides the user with an output (preferably a Japanese language or Spanish language translation of English-language input text), the user can then interface with the system to edit the displayed translation or to obtain alternative translations in an automated fashion. An operator of the automated natural language translation system of the invention can be more productive because the system allows the operator to retain just the portion of the translation that he or she deems acceptable while causing the remaining portion to be retranslated automatically. Since this selective retranslation operation is precisely directed at portions that require retranslation, operators are saved the time and tedium of considering potentially large numbers of incorrect, but highly ranked translations. Furthermore, because the system allows for arbitrary granularity in translation adjustments, more of the final structure of the translation will usually have been generated by the system. The system thus reduces the potential for human (operator) error and saves time in edits that may involve structural, accord, and tense changes. The system efficiently gives operators the fill benefit of its extensive and reliable knowledge of grammar and spelling.
The automated natural language translations system""s versatile handling of ambiguous sentence boundaries in the source language, and its powerful semantic propagation provide further accuracy and reduced operator editing of translations. Stored statistical information also improves the accuracy of translations by tailoring the preferred translation to the specific user site. The system""s idiom handling method is advantageous in that it allows sentences that happen to include the sequence of words making up the idiom, without intending the meaning of the idiom, to be correctly translated. The system is efficient but still has versatile functions such as long distance feature matching. The system""s structural balance expert and coordinate structure expert effectively distinguish between intended parses and unintended parses. A capitalization expert effectively obtains correct interpretations of capitalized words in sentences, and a capitalized sequence procedure effectively deals with multiple-word proper names, without completely ignoring common noun interpretations.
The present invention is directed to an improvement of the automated natural language translation system, wherein the improvement relates to translating input textual information having annotations and being in a source or first natural language, such as English, into output textual information with the annotations preserved and being in target or second natural language, such as Japanese or Spanish. The annotations in the source document can represent part-of-speech settings, Hypertext Markup Language (xe2x80x9cHTMLxe2x80x9d) markup, Standard Generalized Markup Language (xe2x80x9cSGMLxe2x80x9d) markup, Rich Text Format (xe2x80x9cRTFxe2x80x9d) markup and Nontypesetting Runoff (xe2x80x9cNROFFxe2x80x9d) markup. In the present invention, annotations can be removed prior to translation, stored in an annotations database and inserted by the system at appropriate locations in the translated version of the source text. The system of the present invention employs a novel process involving creating a token string which includes word tokens representing the text, annotation tokens representing the annotations and ending tokens representing sentence breaks and sentence endings in the source document. As the word tokens are transformed and the annotation tokens are processed or otherwise removed during translation, the ending tokens are the only tokens that remain intact in the token string as the token string passes through the translator. As such, the ending tokens are used by the system to provide information relating to the original word tokens and annotation tokens as they appeared in the source document in the first language. Annotation tokens are stored in a document state database and linked with all other tokens in the document such that the annotations for any word token in the document can be determined. In this manner, the annotations are inserted at appropriate locations in the translated target document.
In one aspect, the system receives a source document in a first language comprising a plurality of sentences having text and annotations, and creates a first token string comprising a plurality of first language tokens and a plurality of annotation tokens disposed in the order of appearance in the source document. Additionally inserted into the token string are a plurality of end-of-sentence tokens to represent sentence endings in the source document. In one aspect of the invention, prior to translation, the plurality of annotation tokens are removed from the token string, stored in the storage module and linked to the end-of-sentence tokens in the storage module. The first language tokens are translated and the second language tokens are created in the target natural language. The end-of-sentence tokens are then used to retrieve from memory the annotation tokens and the links between the first language tokens and the second language tokens to recreate the original source document and determine where the annotation tokens should be inserted therein. Upon determining the locations for inserting each of the plurality annotation tokens, the annotation tokens are inserted into the source document, which can subsequently be stored and used as a reference tool should further processing of the target document or the source document be desired. Additionally, during translation, undefined first language tokens can be stored in the storage module and linked to the end-of-sentence tokens, such that after translation, a list of the undefined first language tokens can be provided to a user of the system.
In another aspect of the invention, the system comprises a computer means having a receiving module for receiving input textual information in a first language transmitted to the computer means by a computer input device, a processing module, a translation engine, and a storage module. The receiving module receives a source document in a first language comprising text and annotation. The processing module creates a first token string using the source document, where the token string comprises a plurality of first language tokens, a plurality of annotation tokens, and a plurality of end-of-sentence tokens. Each of the end-of-sentence tokens are inserted into the first token string at a location corresponding to a discontinuity in the text. The translation engine removes the plurality of annotation tokens from the first token string, translates the plurality of first language tokens to a plurality of second language tokens in a second token string, and creates a target document. In this embodiment, the plurality of end-of-sentence tokens can then be used to insert the annotations into a recreated source document. In an alternative embodiment, the annotations are inserted into the target document. The storage module includes an annotation database for storing the annotation tokens, in which the annotation tokens are linked to the end-of-sentence tokens, a dictionary source database for storing the first language tokens and the second language tokens, in which the end-of-sentence tokens provide links between the first language tokens and the second language tokens in the database, and an undefined tokens database for storing undefined first language tokens, in which the end-of-sentence tokens provide links to the undefined first language tokens in the undefined tokens database.
In another aspect of the present invention, the system preserves annotations such as HTML markup, SGML markup, RTF markup and NROFF markup in the source text. In one aspect of the invention, the processing module creates HTML tokens representing HTML markup in the source document. The storage module further includes a markup database for linking HTML markup with each first language token in the first token string to which the HTML markup applies. The translation engine can further access the markup database and compare the second token string with the HTML markup linked to the first language tokens to determine locations in the second token string where the HTML markup should be inserted.
In still another aspect of the invention, a method for translating an annotated source document in a first language to a target document in a second language having corresponding annotations comprises, receiving a source document in a first language, comprising a plurality of sentences having text and annotations, creating a first token string using the source document, the first token string comprising a plurality of first language tokens and a plurality of annotation tokens that apply to the first language tokens, removing the annotation tokens from the first token string, creating a plurality of annotation records for the first language tokens, each annotation record linking one of the first language tokens to each of the annotation tokens that apply to the first language token, storing the annotation records in a document state database, translating the plurality of first language tokens and creating a second token string comprising a plurality of second language tokens, determining at which locations in the second token string the annotation tokens should be inserted using the annotation records, and producing a target document in the second language using the second token string.
In yet another aspect of the invention, the method of preserving annotations, particularly HTML markup annotations during translation, comprises determining whether any of the annotation tokens comprise HTML characters, determining whether the HTML characters comprise characters entity references, substituting characters for the character entity references, determining whether any of the annotation tokens comprising HTML characters should not be preserved in the second token string, deleting the annotations tokens that should not be preserved, determining whether any of the tokens in the first token string should not be translated, removing the tokens that should not be translated from the first token string, storing the removed tokens, and inserting marker tokens into the first token string in the locations where the tokens were removed. In still another aspect of the invention, the method of preserving annotations during translation comprises determining whether the annotation tokens represent a discontinuity such as a section break or a sentence ending in the source text, inserting ending tokens representing the discontinuity, into the first token string and storing the tokens in the first token string up to the discontinuity in a database indexed by the ending token.
In still another aspect of the invention, the system for preserving annotations includes a means for receiving a user input such as an edit to a source document, an alternate text producer for producing alternate word tokens, and an alternate translator for processing an input from a user and providing translation options to the user.
These and other features of the invention will be more fully appreciated by reference to the following detailed description which is to be read in conjunction with the attached drawings.