The present invention relates to data processing, and more particularly to techniques for searching for information in a text database or corpus.
Most of the techniques in use to retrieve a piece of information in a text corpus are based on substring search (also known as full-text search). Because this basic string search mechanism is weak when the user wants to catch more than a simple sequence of characters various techniques have been developed by data providers to enhance the substring matching: wildcards, regular expressions, Boolean operators, proximity factor (e.g. words must be in the same sentence or no more than N words between two words) and stemming.
Existing techniques often try to achieve the similar goals: to allow the user to better express the variability of the natural language in which the string expression is to be searched in order not to miss any place where this expression appears.
However, known techniques suffer from several drawbacks: the end user has to learn the query language proposed by the search engine; no two search engines have the same query language; if the user doesn't think of all the possible variations of the searched expression, he can miss some relevant documents; and/or on the other hand, if the search expression is too "loose", many irrelevant documents will be retrieved, generating noise.
The linguistic search techniques according to the present invention overcome at least some of the above mentioned problems. They rely both on the linguistic tools (such as a tokeniser, morphological analyser and disambiguator and the generation of complex regular expressions to match against the text database.
This mechanism has the advantages over a basic full text search engine that the end user doesn't need to learn an esoteric query language. He just has to type the multiword expression he is looking for in natural language.
A further advantage is that the retrieved documents will be much more relevant to the query from a linguistic point of view (although it doesn't ensure that all relevant documents will be retrieved from the point of view of the meaning).
A further advantage is that many variations will be captured by the linguistic processing. As a consequence, even a user who is not familiar with the language in which the searched documents are written doesn't have to know about the linguistic variation that might occur.
The linguistic search techniques according to the invention provide a new way to search for information in a text database. They enable users to find portions of a text which match multiword expressions given by the user. Matches include possible variations that are relevant with the initial criteria from a linguistic point of view including simple inflections like plural/singular, masculine/feminine or conjugated verbs and even more complex variations like the insertion of additional adjectives, adverbs, etc. in between the words specified by the user. This technique can complement conventional full text search engines by reducing the number of retrieved documents that are inconsistent with the query.