The present invention relates to information retrieval, and more particularly to retrieving multilingual documents.
International business and corporate globalization require that companies communicate in many natural languages and dialects. For many instances of an event or an object, there are substantial amounts of interrelated reports, reviews, or descriptions from different perspectives and in different natural languages. Multiple versions of a single internal document of a company may be generated, where each of the versions has the same content, but is expressed in a different natural language. A user may utilize a search query expressed in a natural language in which the user is proficient in order to find multilingual versions of a document. Known information retrieval systems translate multilingual versions of a document into a single natural language, and then apply searches across the multilingual versions using that single natural language. The translation of a document is often inaccurate, especially if the document includes terminology in a domain of a specific industry. To improve translation accuracy, users must expend effort to understand the meaning of the document and know how to translate the document from one language to a different language, including how to translate terminology that is specific to a particular industry's domain.
Other known information retrieval systems require multiple search engines to search for multilingual versions of a document, where a single search query is translated into multiple natural languages and each search engine searches for text in a corresponding one of those natural languages. The translation of the search query is subject to the same inaccuracies and difficulties mentioned above.
Still other known information retrieval systems perform multilingual searching based on identifying images in a first document expressed in a first natural language and then searching for documents expressed in other natural languages that include images similar to the images in the first document. Different language versions of a document that have the same textual content, however, may not always include similar images, and therefore inaccurate search results may be generated.
Accordingly, there is a need for a technique that efficiently searches for multilingual documents without requiring translation of the documents or the search query and without requiring multiple, language-specific search engines.