Cross-language information retrieval (CLIR) deals with providing a query in one language, and searching document collections in one or more different languages. For example, a user may pose a query in Chinese but retrieve relevant documents originally written in English. Cross language information retrieval is also called multi-lingual or trans-lingual information retrieval.
In this era of information explosion, especially with the advent of the World Wide Web (WWW or Web) in which anyone can create his or her own website (for example, blogs), how to find the information a user seeks from vast amounts of available information remains a challenging task, and how to find the information the user seeks even if it is written in a different language is even more daunting. In many instances, the most relevant information is written in a foreign language. However, a language barrier may prevent the user from retrieving such documents using conventional IR tools. For example, if the user sends a query “Iraq war” in English, the conventional IR system does not obtain articles about “Iraq war” written in Chinese, such as are available at http://141.155.90.70:88/files/articles/Iraq.htm, and therefore does not present the specific views communicated in Chinese regarding the Iraq war. On the other hand, if the user sends a query in Chinese  the system does not obtain articles written in English, such as at http://www time.com/time/time100/leaders/profile/mao.html, and therefore does not present the specific views communicated in English regarding Mao Zedong.
In conventional techniques of performing mono-lingual search (information retrieval), the user specifies a set of words, phrases or sentence (individually and/or collectively hereinafter “term”) which convey the semantics of the information sought, in a frame, which is also called a query, and the query is submitted by, for example, pressing a “Search” button nearby. The conventional system searches for documents related to this query in the target document set (for example, all or a subset of documents available on the Web), in as complete a manner as possible, then may rank them according to the degree of relevance to the query, and displays the results of the search in ranked order. The primary goal of a conventional IR system is to find as many of the documents that are relevant to the user query (recall) as possible while retrieving as few non-relevant documents as possible (precision). A conventional IR system is shown in FIG. 1.
There are many types of documents on the Web, and the documents are represented in different languages. There are documents of different formats (for example, Html, Doc, PDF), and there are images with captions in different languages. A single query preferably triggers search of all such resources.
Queries are typically processed so that the IR system can utilize them to perform appropriate searches. If an IR system can translate a query to another language, the system can search the document set for relevant documents in the other language. Similarly, if the IR system can translate the query to another form, then the IR system can augment the search. Often, interactions between the user and the machine are needed to ensure a comprehensive search.
A number of technologies have been proposed for dealing with how to represent the documents inside the computer. In addition, there are many other IR techniques dealing with query processing, indexing, ranking, etc. For example, in one conventional technology, documents in a collection are represented through a set of index terms or keywords. Such keywords might be extracted directly from the text of the document or might be specified by a human operator such as frequently performed in the library science. An example of an indexing approach is shown in FIG. 2.
In cross-lingual search, one or more language translations are performed, for example, translate a query from source language into target language and then perform mono-lingual search using the translated query, or translate the documents from target language into source language and then perform mono-lingual search using the original query. It has also been proposed to translate both query and documents to a certain intermediate representation, so that the two can be compared. The table in FIG. 3 shows a brief summary of current approaches for CLIR.
U.S. Pat. No. 5,301,109 entitled “Computerized cross-language document retrieval using latent semantic indexing” proposes a corpus-based, intermediate representation approach to CLIR. U.S. Pat. No. 5,867,811 entitled “Method, an apparatus, a system, a storage device, and a computer readable medium using a bilingual database including aligned corpora” also proposes a corpus-based approach.
U.S. Pat. No. 6,321,191 entitled “Related sentence retrieval system having a plurality of cross-lingual retrieving units that pairs similar sentences based on extracted independent words”, proposes a technique for retrieving related sentences out of n cross-lingual retrieval systems. Each of the n systems comprises a pair data storing unit that stores a multiple of pair data (two languages) having the same meaning.
Both monolingual information retrieval and cross-lingual information retrieval face the difficulty of understanding the user's true intention when the user queries in natural language. Information retrieval is not data retrieval, which consists mainly of determining which documents in a collection contain the keywords in the user query. The user of an IR system is concerned more with retrieving information about a subject than with retrieving data which matches a specified query. The user simply expresses the information sought in natural language. So an IR system ideally embodies some understanding of the natural language. For example, if the user queries  documents containing  but not containing  or  may not be displayed to the user.
To ensure the completeness of the search result, queries can be processed into a group of synonyms, for example, (cell phone, mobile phone, cellular phone,   Bush), etc.
The technique of using a thesaurus (where groups of synonyms are stored) in CLIR was proposed by G. Salton, “Automatic Processing of Foreign Language Documents,” (1970) Journal of American Society for Information Sciences. Salton reported experimenting with a method for automatic retrieval of documents in one language in response to queries in another using a vector representation and search technique in conjunction with a manually created dual-language thesaurus. The results for test samples of abstracts and queries were promising. However, creating an adequate multi-language thesaurus is difficult and requires considerable intellectual labor.
The following example explains why it is desirable to expand one query into a group of synonyms. To search for documents regarding Cross-lingual Information Retrieval, multiple synonyms, such as “Trans-lingual Information Retrieval”, “Multi-lingual Information Retrieval”, may be substituted in the search, and “Information Retrieval” may be substituted with “Search” or “communication”, and “-lingual” maybe substitute with “Language”. The phrase in this example can be expanded to at least 12 synonyms or related terms. In addition, the search may be guided by specification of the relevant field of technology such as “search engine”, “machine translation”, etc.
For example, a user may query the Spanish term “conjeturar sin fundamento” in a search engine, and may retrieve results containing or related to the exact query terms. However, in order to search for documents regarding “conjeturar sin fundamento” and find as many of the documents that are relevant to the user query as possible, multiple synonyms, such as “adivinar a ciegas”, “hacer suposiciones gratuitas”, may be substituted in the search. But how to automatically generate a set of synonyms to trigger multiple searches based on one query term remains a challenging task.
The Software Department of the Institute of Computing Technology in China developed a Question and Answering Search Engine System about Tourism in China, and used a thesaurus to expand the user query into a multiple of synonyms or related words. However, the thesaurus was manually developed and maintained by human information specialists.
Dictionary based approaches generally have a major problem of OOVs (Out of Vocabulary words), such as person names, company/organization names and place names, brand names, etc. Conventional CLIR approaches based on static dictionary cannot overcome this difficulty. In addition, a thesaurus may expand the user query into a multiple of synonyms or related words but cannot retrieve as few non-relevant documents as possible, because it cannot specify the relevant context for the search to narrow the scope of the retrieved results.
U.S. Pat. No. 6,604,101, entitled “Method and system for translingual translation of query and search and retrieval of multilingual information on a computer network”, proposes a “restricted/controlled query” method in which after the user inputs a query in the source language, it is standardized or regulated in a “dialectal controller”, and if no standardized form of the user's query is found, the user is prompted to describe the information sought in another way, and then the standardized query words are translated into target language terms, which are used to search the target-language document set. U.S. Pat. No. 6,604,101 does not disclose or suggest a multilingual, dynamically evolving dictionary which stores synonyms or related words, or similar sayings.
U.S. Patent Application Publication No. 20040139107A1, entitled “Dynamically updating a search engine's knowledge and process database by tracking and saving user interactions”, proposes supplementing a query with information from tracking user interactions and saving the information. However, U.S. Patent Application Publication No. 20040139107 A1 does not disclose or suggest updating a multi-lingual knowledge base according to voting by multi-lingual Web users.
In U.S. Patent Application Publication No. 20040139106 A1, entitled “Search engine with natural language-based robust parsing of user query and relevance feedback learning”, proposes an approach to accommodate the user through interaction and feedback to and from the user. However, U.S. Patent Application Publication No. 20040139106 A1 does not disclose or suggest a multi-lingual knowledge base according to voting by multi-lingual Web users.
U.S. Pat. No. 5,384,701, entitled “Language translation system”, proposes a system for translating phrases from a first language into a second language. The system includes a store holding a collection of phrases in the second language. The phrases in the second language are prepared in advance and held in the store. For example,  was stored as “How do you do?” However, U.S. Pat. No. 5,384,701 does not disclose or suggest that the knowledge base can be dynamically updated through contribution of Web users.
There remains a need for improvements to cross-language information retrieval techniques.