The present invention relates to a method of and an apparatus for retrieving information. The invention also relates to a storage medium containing a program for performing such a method. These techniques may be used in information management systems, such as information retrieval systems or xe2x80x9csearch enginesxe2x80x9d, information filtering applications also known as information routing systems, and information extraction applications.
D.A. Hull and G. Greffenstette, xe2x80x9cQuerying across Languages: a Dictionary-Based Approach to Multilingual Information Retrievalxe2x80x9d, 19th Annual International Conference on Research and Development in Information Retrieval (SIGIR ""96), pages 49-57, 1996, discloses a dictionary-based approach to cross-linguistic retrieval. In order to search for documents containing information of relevance to a chosen topic, a query is formulated by the searcher. A typical query comprises a short item of text, such as a sentence, which indicates the subject matter to be located. A document collection in the same language may then be searched by looking for matches between at least some of the words of the query and the full text of each document.
In order to search documents in a different xe2x80x9ctargetxe2x80x9d language from the xe2x80x9csourcexe2x80x9d language of the query, the dictionary-based approach looks up the query terms in a bilingual dictionary. All possible translations of each source language query term are used to form a query in the target language and the matching process is then performed in the target language.
This technique therefore allows a searcher to formulate a query in a language which is different from the language of the documents to be searched.
In known cross-linguistic retrieval systems employing query translation techniques, all terms (words and collocations) of the query are translated into the target language and either all possible translations of each query term are used or one deterministically preferred translation of each term is used to form the target language query. However, both of these approaches have disadvantages.
Selecting all possible translations of source language query terms may lead to the retrieval of many documents which are not relevant to the query. This is because source language words have different meanings in different contexts and, based on these, have different preferred translations. Given the large number of documents available in typical information systems, this may mean that it is difficult for a searcher to identify the documents needed among the large number of irrelevant documents which may be identified.
Use of only the preferred translation of each query term solves the problem of large numbers of documents. However, known machine translation systems are of limited accuracy and would frequently select an inappropriate translation as the preferred translation. Thus, whenever the translation system selects the wrong translation, the information retrieval system is not very likely to identify documents which are relevant to the subject matter which is sought.
Techniques exist for analysing source language text to identify co-occurring words or collocations in an attempt to use contextual information in order to improve translation accuracy. Such a process aids the selection of sensible translations because there are fewer possible translations of a collocation than of its separate constituent words. For example, the collocation xe2x80x9cmake use ofxe2x80x9d has only a few translations into target languages whereas the frequently used terms xe2x80x9cmakexe2x80x9d, xe2x80x9cusexe2x80x9d and xe2x80x9cofxe2x80x9d give rise to a large number of translation terms.
Although using collocations assists in limiting the number of target language query terms which are generated, most known systems are only capable of recognising continuous collocations ie. words co-occurring next to each other. In practice, a substantial number of collocations in real languages are non-continuous. For example, the collocation xe2x80x9cmake use ofxe2x80x9d may occur in natural language documents as xe2x80x9cmake good use ofxe2x80x9d, spanning the word xe2x80x9cgoodxe2x80x9d so as to be a non-continuous collocation.
EP 0 813 160 and GB 2 314 183 disclose a glosser for identifying and translating continuous and non-continuous collocations. A xe2x80x9cglosserxe2x80x9d enables an (ordered) plurality of source language words (or collocations) to be labelled with target language translations.
Another disadvantage of known arrangements of the type described hereinbefore is that identified documents are presented to the searcher in the target language. Thus, although a searcher who is unfamiliar with the target language can retrieve documents of relevance or interest in the target language, such a searcher cannot then check the relevance and content of retrieved documents unless he or she is familiar with the target language. Thus, although known techniques for cross-linguistic information retrieval may be used, the efficacy of such information retrieval can only be checked by searchers who have sufficient familiarity with the target language not to need to use such techniques.
GB 2 320 773 relates to an automatic translation technique which is principally intended for use on the Internet. It is based on searching by character strings for useful documents or files and selecting the most appropriate translation environment (such as a glosser or machine translation system) for located documents on the basis of the character string. Any translation which occurs is performed exclusively on located documents by the most appropriate translation environment for the subject matter as identified by the character string.
WO 97/18516 is specifically concerned with translating Web pages while preserving the original appearance. An HTML document is pre processed by placing notional barriers around the HTML codes so as to preserve them. The remaining text and data outside these barriers are then translated to the desired language. Finally, the barriers are removed so that the pages retain their original format or appearance but all relevant text is translated into the desired language. Queries are formulated conventionally in the usual address codes and undergo no processing but are merely used to access desired documents.
WO 97/08604 discloses an information retrieval system which is based on translating queries and documents. However, this technique makes use of a language-independent conceptual representation of each query and of each document which is available for searching. Thus, in order for the system to work, all documents must first be subjected to a xe2x80x9ctranslationxe2x80x9d procedure in which the conceptual representation of the document subject matter is formed. Queries are similarly processed and searching is performed by matching the conceptual representations.
According to a first aspect of the invention, there is provided a method of retrieving information from a plurality of documents in a target language using a query in a source language, comprising converting the query into the target language using a multilingual resource, forming a query in the target language from the converted query, applying the query in the target language to an information management system, and converting at least part of the or each document in the target language identified by the information management system into the source language using the multilingual resource.
A multilingual resource is any system which is capable of converting a term (word or collocation) in the source language into one or more equivalent terms in the target language. An information management system is any system which is capable of identifying documents containing terms which are applied to the system as a query.
The source and target languages are preferably natural languages.
The multilingual resource may be a bilingual glosser. The glosser may identify and translate each term of the source language query. The glosser may identify and translate terms which are collocations but may not translate the individual words of the collocations. For each term having more than one translation, the glosser may supply more than one of the translations.
The target language query may include at least some of any terms in the source language query which cannot be converted into the target language by the multilingual resource.
The at least part of each document may comprise a title of the document. The at least part of each document may comprise an abstract or abridgement of the document. The at least part of each document may comprise a sentence containing terms which match the query in the target language.
According to a second aspect of the invention, there is provided an apparatus for retrieving information from a plurality of documents in a target language using a query in a source language, characterised by comprising a multilingual resource for converting the query into the target language, means for forming a query in the target language from the converted query, and means for applying the query in the target language to an information management system, the multilingual resource being arranged to convert at least part of the or each document in the target language identified by the information management system into the source language.
The multilingual resource may be a bilingual glosser. The glosser may be arranged to identify and translate each term of the source language query. The glosser may be arranged to identify and translate terms which are collocations but not to translate the individual words of the collocations. For each term having more than one translation, the glosser may be arranged to supply more than one of the translations.
The query forming means may be arranged to include in the target language query at least some of any terms in the source language query which cannot be converted into the target language by the multilingual resource.
The apparatus may comprise a programmed data processor.
According to a third aspect of the invention, there is provided a storage medium characterised by containing a program for controlling a data processor of such an apparatus.
The glosser is preferably of the type disclosed in EP 0 813 160 and GB 2 314 183, the contents of which are incorporated herein by reference.
It is thus possible to perform cross-linguistic information retrieval in such a way that retrieved documents can be examined for relevance by a searcher who is unfamiliar with the target language of the documents. An advantage of using the same multilingual resource for forming a query and for converting into the source language at least part of the or each identified document is that the terms of the converted document or part thereof in the source language are likely to be the same as or similar to the terms used in the source language query. Thus, a searcher who is unfamiliar with the target language can determine with higher precision whether identified target language documents are indeed relevant to the query. The efficacy of cross-linguistic retrieval may therefore be substantially improved irrespective of whether a searcher is familiar with the target language.
An advantage of using a non-deterministic glosser such as that disclosed in EP 0 813 160 and GB 2 3 14 183 is that it generates a preferred translation for each term but also generates a variety of alternative translations, for instance using contextual information in a sentence where available. This considerably limits the number of alternative translations generated. Also, alternative translations may be ranked according to a criterion indicating the likelihood of each translation being correct. Thus, the number of translations actually used in target-language query formulation may be adjusted to the requirements of a searcher.