Not Applicable
This invention pertains to multi-lingual document data warehousing. More particularly the invention pertains to a system and method that can identify duplicates or near duplicates of a document in two different languages.
The Internet comprises a vast resource of information in the form of web pages. These web pages comprise text, graphics, video and other forms of information on a variety of topics the range of which is coextensive with the vast range of users"" interests. The Internet is a global network and thus serves a diverse multi-lingual community.
In the interest of serving the Internet""s multi-lingual community, large organizations and companies may have very large web sites, built up over many years by many people. The sites can be so large that no single person has extensive knowledge of the entire site architecture. These sites may often contain multi versions of documents written in different languages. In some cases different language versions of a web site may be located on different hosts or have separate domain names and be stored in separate directory structures. As the Internet continues to rapidly develop, there often arises the desire to revamp web sites. In the case of multi-lingual web resources (i.e., a single multi-lingual site, or multiple sites in different languages) a plan for revamping may include identifying different language versions of the same document as such. The plan might further include eliminating duplicative documents, in preference of using a real time machine translation function to present the web page to the user, or it might alternatively include adding cross references to the web pages to the different language versions.
A third party such as a search engine dot com might also want to identify different language versions of the same document so as to enable it to present information identifying different language versions to a user.
Because of the layout differences for some languages, for example, Japanese, often being written vertically rather that horizontally, and Hebrew being written from right to left rather than from left to wright, different language versions of the same web page may have a somewhat different Hyper Text Markup Language (HTML) structure in order to accommodate the layout of the particular language. Thus, a strict comparison on the basis of the HTML code structure alone cannot be relied on to identify different language versions of the same document.
The invention to be described makes use of machine translation. In connection therewith, it should be noted that machine translation does not produce an exact inverse function of the human language translation originally used to produce foreign language versions. There will be differences in the text output by a machine translation function and the original document. Therefore, direct string comparisons between the original document and the translation of the foreign language document back into the original language will not yield a match.
What is needed is a system for identifying duplicate versions of web pages which may be written in two different languages.
What is further needed is a system for identifying different language versions of a document, that can identify that the two documents are the same or similar notwithstanding slight differences in the formatting code (e.g., HTML) structure of the documents.
What is further needed is a system for identifying different language versions of the same document that is tolerant of the imperfections of machine translation.
Briefly, according to one aspect of the invention, a method of identifying different versions of the same structured document comprises steps of reading a first portion of text which occupies a first position in a first hierarchical structured document, reading a second portion of text which occupies a second position which is congruent to the first position in a second hierarchical structured document, and obtaining a quantitative measure of similarity of the first and second portions of text.