A user may provide a query in a source language with the intent of retrieving information from a collection of documents that are expressed in a target language. To enable this mode of operation, a system may employ a Cross-Language Information Retrieval (CLIR) system. The CLIR system converts the query terms expressed in the source language to their respective counterparts in the target language. For this purpose, the CLIR system may make reference to a dictionary which maps terms in the source language to corresponding terms in the target language. After conversion, the CLIR system can search the collection of documents using the converted query terms.
The dictionary used by the CLIR system typically cannot account for all of the terms that a user may input as a query. For example, proper nouns and other domain-specific terminology represent a wide class of information that is continually evolving. Hence, the dictionary used by the CLIR system cannot keep abreast of such information. Any query term that is not found in the dictionary is referred to herein as an out-of-vocabulary (OOV) query term.
In certain cases, the presence of OOV terms is not a problem. For example, consider the case in which the query is expressed in English, while the collection of documents is expressed in Spanish. If the user inputs a proper name, such as “Richard Nixon,” the CLIR system can simply fashion a query that leaves this proper name untranslated. In other cases, however, the source language and the target language have different expressive forms. For example, Hindi (expressed in the Devanagari script) and English (expressed in the Latin script) are characterized by different respective orthographies and phonetic alphabets. In this case, the CLIR system cannot simply pass the original OOV query term as a search term because the documents in the target language cannot be expected to include the unconverted OOV query term.
One known way to address this problem is by providing a machine transliteration (MT) system. An MT system operates by applying phonetic and orthographic transformations on an input string to produce a string in the orthography of the target language. However, this solution is not fully satisfactory. If the MT system provides a transliteration that is merely close to the counterpart term used in the documents being searched, the transliteration that is generated may fail to locate the desired documents.
A failure to properly convert OOV query terms may significantly impact the performance of the CLIR system. Since an OOV term is often (although not necessarily) some type of specialized term, such a term may represent a highly informative part of the query, sometimes pinpointing the focus of the user's search objectives. Therefore, without this term, the query may fail to adequately describe the information being sought by the user.