The present invention is related to the searching of information repositories such as databases, and in particular to a facility for generating cross-lingual queries.
Every day more information becomes available electronically over networks. Far from growing linearly, this growth is driven by numerous factors like the increasing accessibility to more media of information, the growing power of computers and networks, and the ever more data-intensive applications we are working with.
This gold mine of data however suffers from a lack of structure and consistency: the Web is unstructured and uncontrolled by nature, whereas structured databases use a widening variety of formats, either standardized or proprietary.
When accessing heterogeneous legacy databases on Intranets or while querying multiple information sources on the Internet, the end-user only wants to have a simple and straightforward point of access.
With classical tools, finding the right information to suit each user""s needs is now the problem, for anything but the easiest of searches. The user must master different protocols; different database access methods; different document formats; and then use the information from one search to manually drive another. Thus, there is a need for information retrieval systems and approaches for easily interfacing into multiple information sources.
An exemplary information retrieval architecture is described in the article entitled xe2x80x9cSystem Components For Embedded Information Retrieval From Multiple Disparate Information Sourcesxe2x80x9d, Ramana B. Rao, Daniel M. Russell, and Jock D. Mackinlay, Proceedings of 1993 ACM Symposium on User Interface Software and Technology, Atlanta, Ga, Nov. 1993 ACM SIGGRAPH and SIGCHI. The architecture incorporates an intermediary server which mediates access requests between an information access client (i.e. the user) and various information sources. Thus, the user only needs to interface with the information access client in order to retrieve the information from multiple information sources.
Another characteristic of information on the Web is that it can be in any language. Generally, a query only searches for items that are in the same language as the query. In situations where information found is in a different language, the reason is typically because the information contains a xe2x80x9cwordxe2x80x9d that matches a search term. For example, a search for information on a famous person or event, may results in receiving information/documents in multiple languages.
However, what would be desirable is to obtain documents in different languages. So take for example a topic such as xe2x80x9ctreesxe2x80x9d. It would be desirable to translate the search term trees into the various languages in which documents would occur. A search may then retrieve information in those translated languages.
A dictionary based method for cross-lingual information retrieval is described by Lisa Ballesteros and Bruce Croft, xe2x80x9cDictionary Methods for Cross-Lingual Information Retrievalxe2x80x9d, Lecture Notes in Computer Science 1134 ISSN 0302-14 9743 (1996). The paper describes experiments which analyze the factors that affect dictionary based methods for cross-lingual retrieval and present methods that dramatically reduce the errors such an approach usually makes. The paper defines cross-lingual information retrieval as the ability to query in one language but perform retrieval across languages.
The invention relates to the searching of network accessible distributed databases, such as those found on the Internet. This invention enables a user to generate a query using search terms and expressions in their native language and to specify that the search results may include documents in other languages. With the query, the user indicates the target language in which results will be accepted. The system then processes the query using computational linguistic techniques and verifies the accuracy of the results received with respect to their language and the linguistic structure of the initial search terms. In a multi-word expression all combinations are verified automatically.
1. The method of the invention is comprised of the following steps: Split each multi-word search expression among the search terms into elementary words and suppress stopwords (and, the, etc.);
For each language in which documents will be retrieved:
2. determine for each resulting elementary word the stemmed translations,
2a. translate the elementary word into the target language; and
2b. stem the translated word;
3. search for documents containing one of the resulting combination of stemmed translations;
4. verify for each found document that the stemmed translations appear in the correct linguistic structure so that inappropriate results can be eliminated.