1. Field of the Invention
The invention relates to a method, computer program product and data processing system for the detection of multilingual textual resources carrying the same information content.
2. Background
Textual resources such as news articles and user manuals are often available in several languages. The widespread and increasing use of the internet has made the availability of such textual resources greater. Some of such resources are made available in different languages by the same provider. Others are made available by different providers for example the daily news published on the internet. The same or similar news stories are often found in different languages. The texts of such articles may not be parallel. That is, they may not be an exact translation from one language to another.
The availability of textual resources in different languages may be of enormous use to a user. For example, if a user requires a news article in a second language corresponding to a news article in a first language the availability of such an article in a second language may give the user an accurate translation into the second language. This is of great benefit since machine translation tools may not give translations of an acceptable quality and human translation which gives an acceptable quality can be very expensive. In another example, a user of a device may only have the user guide or instructions for the device in a language which is not their native language. In such a situation, the availability of the user guide and instructions in the user's native language may be of great benefit to the user.
Thus, the availability of multilingual textual resources can be of great benefit to users. There are, however problems in detecting textual resources having the same contents in different languages.
U.S. Pat. No. 6,993,471 proposes a system that translates HTML documents using machine translation software bundled in a browser. This allows a user to access textual resources in languages other than the language that the textual resource was written, however, the quality of the output is limited by the quality of the machine translation software output.
Another approach is to try and collect parallel sentences. For example US 2005/0228643 discloses the extraction of a set of features from the output of a sentence alignment engine, and then uses them to train a maximum entropy classifier to detect parallel sentences. Such a system works to detect parallel sentences, but is not suitable for detecting documents having the same textual content. Textual resources are rarely exact translations of each other even where they come from the same source. There may also be considerable differences in length, and the level of detail.
U.S. Pat. No. 6,604,101 discloses a system for translating a query input by a user in a source language into a target language and searches and retrieves web documents in the target language. Such an approach allows several documents in a second language that match a query in the first language to be found. It does not however allow a document in a second language having the same textual content as a document in a first language to be found.