An aligned corpus consists of words, phrases and sentences in a first language, mapped onto substantially similar words, phrases or sentences in a second language. The aligned corpus is used in automated translation systems in which, given a word, phrase or sentence in a first language, the equivalent in the second language may be obtained. Similarly, given a word, phrase or sentence in the second language, its equivalent in the first language may be obtained. This principle may be extended, such that a multi-lingual system may be provided, so that, given a word, phrase or sentence in any of the languages available, all the others may be translated simultaneously.
A system for translating text is shown in FIG. 1 and provides an environment for employing an aligned corpus.
Operating instructions and data from the aligned corpus are supplied to a processing unit 15 from a hard magnetic disk drive 16. A floppy disk drive 17 receives floppy disks containing an input text, in a first language, and also receives data relating to an output text in a second language, which is written to a separate file on the floppy disk. At the end of the process, the floppy disk holds the original file of the input text plus, in a separate file, the translated output text.
In the 1950s and 60s it was a common belief that the development of an all purpose translating system would become available in the not too distant future. It was then realised that such a system was much further off and possibly would never be implemented, given the problem of including sufficient background information, to facilitate intelligent translation. However, it was also appreciated that the problem of providing translation within a smaller specalised field would be possible, given that many words which have many different meanings, would tend to have a much limited range of meanings within the confines of a specialist field of activity.
However, a problem of creating a translation system for operation within a specialist field of activity is that of generating aligned corpora, given that a corpus generated for one field of activity would probably not be suitable for application in another field of activity. Thus, it would be necessary for users working in each field to generate their own corpora. Consequently, this problem has tended to negate the use of such automated systems and reliance continues to be made upon human translators.
The systems shown in FIG. 1 could be used, rather than a replacement to a translator, as an assistant to a translator. Thus, each sentence, or part of a sentence, could be displayed on an output device, such as a visual display unit 18, while information could be supplied to the processing unit 15 via an input device, such as a keyboard 19.
The operation of such a system could be in the form as shown in FIG. 2. As previously stated, an aligned corpus 21 is resident on the hard magnetic disk drive 16, or similar device, an input file is resident on the floppy disk drive 17, or similar device and the output file is written, after being generated by the processing unit 15, to the floppy disk drive 17. In an alternative arrangement, two floppy disk drives could be provided and the output file could be written to the second drive. Alternatively, the output file could be written to the hard disk drive unit 16 or to any other suitable storage device.
Documents are processed on a page by page basis. The flow chart shown in FIG. 2 therefore describes operation of the system with reference to a single page. A page may be loaded which does not actually contain any information and it is important that the system does not become locked-out when it has no information to process. At step 24 the question is posed as to whether the end of the page has been reached. If yes, the process stops at step 25. Normally, the page will contain text, therefore the first sentence of the input file is read at step 26. An enquiry is now made to the aligned corpus 21 to ask whether the sentence under consideration is present within the corpus, at step 27. If the input sentence is present in the corpus, the aligned output sentence is returned from the corpus and at step 28 the translated form of the sentence is written to the output file. In one embodiment, the operator may be asked to check the translation, by means of the translation being supplied to the visual display unit 18, before the data is actually written to the output file. However, in the embodiment detailed in FIG. 2, the translation is made automatically, so as to improve processing speed.
If, in response to the enquiry made at step 27, the input sentence is not present in the corpus, the operator is prompted to provide an input, via the keyboard 19, of the correct translation, at step 29. At step 30, the translation provided by the operator is written to the destination file and an enquiry is made to the operator, at step 31, enquiring as to whether the new translation should be added to the corpus. If the operator responds in the affirmative, the new alignment is added to the corpus at step 32. If the operator's response is negative, step 32 is ignored.
Thus, in response to each requirement to translate a sentence, three responses become possible. In the first, the translation is present in the corpus and the translation is automatically written to the output file. Alternatively, the sentence is not present in the corpus, an input is provided by the operator and the translation is then added to the corpus after being written to the output file. Thirdly, the sentence is not present in the corpus, again an input is provided by the operator but this time the new translation is not added to the corpus.
After writing a sentence to the output file, operation returns to step 24, at which time the enquiry is made again as to whether the system has reached the end of the page. Again, if the response to this enquiry is affirmative, another sentence is read at step 26 and the procedure is repeated. At the end of the page, as previously stated, the procedure stops at step 25.
Thus, it can be seen that, on the assumption that similar subject matter is being translated repeatedly, the system will learn and entries within the corpus will expand. The knowledge base of the corpus will increase and, eventually, an operator providing manual translations will no longer be required and an operator of minimal skill may be allowed to take over. Possibly, several systems may run in parallel and a manual translator may be required occasionally to assist non-skilled operators.
A problem with the system shown in FIG. 2 is that it may take a significant resources to build up the corpus to the point where the non-skilled operator may take over. Initially, it is likely that use of the system will actually take longer than a straight forward manual translation. Furthermore, it is also highly likely that systems, possibly operating within the same office, will develop differently, with a corpus on one being significantly different from a corpus on another, such that operators would appear to be working at different rates, again leading to further unpredictability.
Methods for automatic generation of aligned corpora have been described for example by W A Gale and K W Church in "A Program for Aligning Sentences in Bilingual Corpora", and by P F Brown Et Al in "Aligning Sentences in Parallel Corpora", both in the Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley Calif. In these systems, the portions used correspond to sentences, and alignments is performed by comparing the lengths of sentences, either in the number of words (Brown Et Al) or the number of characters (Gale and Church).
Both of these references exploit the availability of the Canadian Hansard in two languages, French and English. Brown Et Al further exploit the presence of descriptive mark-up codes in the Hansard texts, for example codes indicating the times of speeches, the names of the speakers and so on. These codes are used to define anchor points in the text, and preference is given to sentence alignments which preserve the alignment of the anchor points. Of course, descriptive markers are not available in documents in general, and are not often in a common language, even when they are present.
It is an object of the present invention to provide an improved system for generating useable aligned corpora. It is also an object of the present invention to provide a plurality of copies of corpora which may be used efficiently within a translating environment.
The inventors have recognised that, in many cases, the similar documents which are to be used as the source texts are availabe in a form which contains presentational formatting data, for example specifying the size or font to be used for output, indentations, tabulations and other layout attributes. Provided that the two source documents have similar presentational attributes, formatting data included in the source files can be used to assist in the alignment.
Accordingly, a first aspect of the invention provides methods and systems for aligning source texts of different natural languages to produce or add to an aligned corpus, wherein source text files representing similar informaion in different natural languages are read, and information aligning similar text portions from respective files is recorded, characterised in that said source text files have similar presentational attributes, and in that the alignment is performed with reference to presentational formatting data present within said text files.
The formatting data may be non-textual data, for example word processing commands. Where different word processors have been used and generate different, possibly non-textual formatting commands, these may be converted to generic forms prior to performing the alignment.
If the formatting data are converted to textual forms prior to performing the alignment, standard text file comparison means can be used to identify alignments.
As an alternative to aligning sentences, it may be advantageous for certain classes of documents to use the formatting data actually to delimit the aligned text portions.
Thus the problem of generating an aligned corpus is effectively resolved by making use of texts in machine readable form. In particular, reliance is made upon correlated texts in different natural languages. Two texts are considered to be correlated, as defined herein, when they convey the same information but in different natural languages. In addition, each page of the correlated texts may contain substantially the same information, but in different languages, laid out in a similar format. Thus, titles, tables, character modifications, may all be present at substantially similar positions.
The invention can be of particular use in the production of multi-lingual product documentation. Many products are sold with sophisticated documentation, explaining exactly how the product operates. Sometimes, such documentation may run to many hundred pages and must be generated in many different natural languages. Consequently, the cost of producing such documentation becomes a significant part of the total cost for the product itself. Furthermore, the time incurred in generating such documentation may result in a significant delay being introduced between the date on which the product is available for market and the date on which the technical manual is available to accompany the product. This often results in badly written and badly translated documentation, in an attempt to get the product to market early. Alternatively, further delay may result in potential sales being lost to competitors.
Many organisations have produced a large number of manuals, in which each translation is correlated to the original text. Thus, for each translation, the same WP system has been used as for the original and the same formatting has been used. Thus, each page of the manual in a first language looks, at first sight, similar to the equivalent page in the equivalent manual of a different language, in that headings, paragraphs and drawings etc. all appear in more or less the same place. However, the actual words within the text are different, in accordance with a particular natural language being used. It is therefore apparent that a great deal of source material is often available which, employing the present invention, may be used to produce aligned corpora which are immediately useable by unskilled operators. Furthermore, such a procedure will produce corpora that are consistent, thereby ensuring that all machines using copies of the same corpus are equivalent.
In certain embodiments, each word processor (WP) file is converted into an intermediate file, in which data relating to specific WP commands, unique to a particular WP system, are converted into a general identifiable form. Thereafter, reference is made to the identifiable WP commands, as a means of aligning the text held between the layout commands, which have been placed into identifiable form.
In a preferred embodiment, different WP commands for different WP systems are converted to similar identifiable commands in the respective intermediate file. It is then possible to identify alignable text by comparing files to identify differences between the files, wherein identifiable WP commands are not different between the files. Text portions identified as being different are written to the aligned corpus.
The invention yet further provides methods and apparatus for automatic translation, wherein information of alignments between text portions has been generated and stored by use of the invention as set forth above.