The following relates to the translation arts, natural language translation arts, format conversion arts, and so forth.
In translation or conversion applications, information content in a source language or source format is converted to a target language or target format. For example, content in a source natural language such as French may be translated to a target natural language such as English. In another example, a document in a source format such as XML using one document type definition (DTD) may be converted to a target format such as XML using another DTD.
One general approach for natural language translation is the phrase-based approach, in which a database of bilingual source language-target language pairs is referenced. Portions of source language content to be translated are compared with the source language elements of bilingual pairs contained in the database to locate source language matches, and the translation is generated by combining the target language elements of matching bilingual pairs. Phrase-based translation approaches are useful in natural language translation because natural language content tends to deviate from the standard rules (i.e., “grammar”) relatively frequently, and such deviations are readily handled by a suitably comprehensive bilingual phrase database.
However, phrase-based translation performance depends on the comprehensiveness of the bilingual pair database, and can also depend on text length of the bilingual pairs in the database. Matching short phases produces many matches, but the short text length of the matches generally reduces matching reliability. Also, grammatical rules may be violated in combining the short phrases to construct the translation. At the opposite extreme, in a “translation memory” approach the bilingual pairs have long text lengths (possibly extending over multiple sentences or even multiple paragraphs), and an exact (or even close) match is likely to be correct. However, the number of matches is greatly reduced compared with short phrases.
Another translation approach is the hierarchical grammar-based approach, in which a grammar including rewriting rules is used to parse the natural language content. The grammatical structures are hierarchically arranged—for example, a noun and a pronoun (and perhaps an adjective or so forth) are combined to form a noun phrase which in turn is combined with a verb phrase (similarly built up from parts of speech such as a verb and adverb) to form a sentence. The grammar used for translation applications is a synchronous grammar in which grammatical structures (e.g., grammatical phrases such as noun phrases and verb phrases) in the source and target languages are matched up or synchronized. The translation process then amounts to parsing the source language content and using the synchronized target language grammatical structures together with a bilingual dictionary or lexicon to construct the target language translation.
Hierarchical grammar-based approaches are applicable to varying lengths of text and generate translations that comply with grammatical rules of the target language. However, hierarchical grammar-based approaches can fail when the source language content deviates from the standard grammar, for example in the case of certain collocations or terminological expressions. These approaches may also fail to capture target language translations that employ such grammar deviations in the target language.
Although described in terms of natural language translation, these considerations apply more generally to translation or conversion tasks in which source content structured in a source format is converted to a (different) target format, in which the content may deviate from precise adherence to the formats. For example, structured electronic documents are typically structured, e.g. in XML in accord with a document type definition (DTD). However, the document may occasionally deviate from the DTD. Such deviations may be variously handled, for example by applying default formatting during rendering.
The following sets forth improved methods and apparatuses.