The present invention relates to obtaining query data for information retrieval.
Most multilingual speakers can read some languages more easily than they can generate correct utterances and written expressions in those languages. When searching for information, existing information retrieval systems require that the user formulate a query in the language (target language or L2) of the documents and, normally, physically type in the query. Thus, as well as including a query formulation step, such systems do not allow a user to indicate their search interests in their native language (L1).
Ballesteros, L., and Croft, W. B., xe2x80x9cDictionary Methods for Cross-Lingual Information Retrievalxe2x80x9d, in Proceedings of the 7th International DEXA Conference on Database and Expert Systems, 1996, pp. 791-801, disclose techniques in which a user can query in one language but perform retrieval across languages. Base queries drawn from a list of text retrieval topics were translated using bilingual, machine-readable dictionaries (MRDs). Pre-translation and post-translation feedback techniques were used to improve retrieval effectiveness of the dictionary translations.
EP-A-725,353 discloses a document retrieval and display system which retrieves source documents in different languages from servers linked by a communication network, translates the retrieved source documents as necessary, stores the translated documents, and displays the source documents and translated documents at a client device connected to the communication network.
U.S. Pat. No. 5,748,805 discloses a technique that provides translations for selected words in a source document. An undecoded document image is segmented into image units, and significant image units such as words are identified based on image characteristics or hand markings. For example, a user could mark difficult or unknown words in a document. The significant image units are then decoded by optical character recognition (OCR) techniques, and the decoded words can then be used to access translations in a data base. A copy of the document is then printed with translations in the margins opposite the significant words.
The invention addresses a problem that arises with information retrieval where a user has a document in one language (L1) and wishes to access pertinent documents or other information written in a second language (L2) and accessible through a query-based system. Specifically, the invention addresses the problem of generating a query that includes expressions in the second language L2 without translating or retyping the document in the first language L1, referred to herein as the document-based query problem. The document-based query problem arises, for example, where the user cannot translate the document from L1 to L2, where the user is unable to type or prefers not to type, where the user does not have access to a machine with a keyboard on which to type, or where the user does not know how to generate a query that includes expressions in L2.
The invention alleviates the document-based query problem by providing a new technique that scans the document and uses the resulting text image data. The new technique performs automatic recognition to obtain text code data with a series of element codes defining expressions in the first language. The new technique performs automatic translation on a version of the text code data to obtain translation data indicating counterpart expressions in the second language. The new technique uses the counterpart expressions in the second language to automatically obtain query data defining a query for use in information retrieval.
The new technique can be implemented with a document that is manually marked to indicate a segment of the text, and text image data defining the indicated segment can be extracted from image data defining the document.
Automatic recognition can be implemented with optical character recognition (OCR), and automatic language identification can be performed to identify the probable predominant language so that language-specific OCR can be performed. The OCR results can also be presented to the user, who can interactively modify them to obtain the text code data.
Automatic translation can be implemented with a translation dictionary. The text code data can be tokenized to obtain token data; the token data can be disambiguated to obtain disambiguated data with parts of speech for words; the disambiguated data can be lemmatized to obtain lemmatized data indicating, for each of a set of words, either the word or a lemma for the word; and the lemmatized data can be translated. Translation can be done by looking up the words and lemmas in a bilingual translation dictionary.
The query data can define the query in a format suitable for an information retrieval engine. The query data can then be provided to the information retrieval engine.
The new technique can also be implemented in a system that includes a scanning device and a processor connected for receiving image data from the scanning device. After receiving an image of a segment of text in the first language from a scanned document, the processor performs automatic recognition to obtain text code data, performs automatic translation on a version of the text code data to obtain translation data indicating expressions in the second language, and uses the expressions to automatically obtain query data defining a query for use in information retrieval.
An advantage of the invention is that it eliminates the need for knowing how information interest (or query) should be formulated in the target language, as well as eliminating the need for imagining and typing in the query. In certain embodiments of the invention, the user need only designate a portion of an existing document, e.g. a hardcopy document, which is of interest to him.
The following description, the drawings, and the claims further set forth these and other aspects, objects, features, and advantages of the invention.