The exemplary embodiment relates to the field of text processing. It finds particular application in connection with a text authoring system which supervises the authoring of text for applications such as statistical machine translation systems.
There are many applications where it is desirable to employ machine translation systems for translating text from a source language to a target language, such as in the preparation of manuals, text books, and the like. Because machine translation systems are prone to error, which generally increases as the complexity of sentence structure and language increases, authoring systems have been developed to supervise the writing of text by authors, which is to be subsequently translated.
Authoring systems generally employ a computer program which evaluates the text according to specific criteria. For example, the authoring system often only accepts words from a predefined vocabulary. Additionally, the sentences that are accepted by the authoring system typically have a limited number of sub-clauses. This is because, for a machine translation system to work effectively, it needs to be able to recognize the antecedent. The antecedent in a sentence is the word to which a specific pronoun refers. For example, in the sentence: “the friend of my daughter, who is nice,” “who” may refer to “daughter” or “friend.” Depending on which is correct, the adjective “nice” would be translated as the masculine form “gentil,” or feminine form “gentile,” in French.
Commonly, the authoring system uses surface elements, such as the frequency of the word “which” or the number of commas, to provide an estimate of the sentence complexity. Alternatively or additionally, the authoring system may place a limit on the number of words, on average, per sentence of a block of text. The authoring system automatically checks the compliance of a user's text with its internal rules and rejects any non-compliant sentences.
For example, the authoring system may permit a maximum of one subordinated clause in a sentence, specify an average sentence length of 13-17 words (if the text consists of at least four sentences), and specify that all sentences should contain no more than 20 words. Text which passes these stringent rules can be quite hard to read, in part, because it lacks interest for the reader.
For accurate translations, it would be helpful to provide the translation system with information on the syntax, such as whether a noun is a subject or object of a sentence and syntacetic dependencies, such as whether a noun in a sentence is the subject or object of a given verb. Syntacetic parsers have been developed which are able to provide this type of information. However, when a sentence exceeds a certain level of complexity, the parsing is more prone to errors. For example, most parsers are able to extract the subject of a sentence with an accuracy of at best, about 90%. Although good, this is still too low to be used in a system where the smallest error may have disastrous consequences. For example, if an authoring system is used as an input to an automatic translation system, the smallest error might end up in a faulty translation for many sentences. If the quality of the output of a syntacetic parser could be assured, the quality of texts may also be improved, as more complex and richer sentences could be written by the author.