Information retrieval and query expansion systems are becoming more and more important and ubiquitous. The demands on such systems are growing steadily greater. A search corpus may contain millions of words which may be spread over hundreds of thousands of documents. An index suitable for efficient retrieval of information from such a corpus may contain thousands of search terms.
Indices may be produced through free indexing, where terms are automatically extracted from corpora without referring to a controlled list. Alternatively, controlled indexing may be employed using available terminological data and other resources, such as thesauri, ontologies, or key-word lists. The quality of the input index list affects the quality of the results. It is known that many free indexing techniques suffer from overgeneration, even though syntactic and semantic filters are applied. The concomitant disadvantage of using a controlled index list is that it must be manually produced, a time-consuming and expensive task.
Indices of single-word search terms are useful for corpora of a relatively small size, but single-word search terms become inadequate as corpora become larger. Single-word search terms can be quite ambiguous and are unable both to completely cover and to accurately define a large corpus. Moreover, concept-based searching is becoming more and more popular, and many concepts for which a user might like to search are difficult or impossible to define using single-word terms.
As corpora to be searched grow, the use of multi-word search terms becomes more useful and more important. A single-word search term may not sufficiently limit the field of search, with the result that a search may retrieve too many results. Moreover, as the size of corpora increase, single-word terms appear in more and more unrelated portions of a corpus, so that a single-word search is likely to retrieve numerous results having nothing to do with the desired topic.
The use of multi-word search terms leads to greater precision in searching. Through the use of multi-word searches, it is possible to restrict the number of results retrieved by a search and to increase the likelihood that the results retrieved will be relevant to the search topic being sought. However, because of the different permutations in which multi-word search terms can occur, indices consisting of multi-word terms can become quite large. Moreover, because the same meaning or concept can be expressed through numerous different combinations of words, an index may contain numerous variants of multi-word terms.
In order to increase the accuracy of an index of multi-word terms and to decrease the work involved in searching the index, it is advantageous to reduce the number of multi-word term variants and join all the multi-word variants under a single index. That is, the number of multi-word search terms which have the same meaning. As the size of databases continues to increase, the need to reduce the size of indices increases. In conflating term variants under the same index, a system can be built more efficiently, since the term remaining after reduction is able to retrieve all documents which could be retrieved by the original terms.
In conflation, a reference term is called the `original term.` It is convenient to consider variants as belonging to one of two types. A type 1 variant results from the inflection of individual words and from modification of the syntactic structure of the original term. For example, `diseases of the lower urinary tract` is a type 1 variant of `urinary tract disease.`
A type 2 variant differs from a type 1 variant under the following condition: at least one of the content words of the original term is not found inflected in the variant, but is transformed into another word derived from the same morphological stem. Thus, `translational or transcriptional inhibition` is a type 2 variant of `translation inhibitor` which is not a type 1 variant because both content words of the original term have undergone derivational morphologic changes.
A more precise definition is as follows:
A type 1 or type 2 variant of a multiword term is a textual utterance such that:
each content word of the original term (type 1) or another word deriving from the same morphologic stem (type 2) is found in the variant, and PA1 the variant can be substituted for the original term in a task of information access.
Since variants can be substituted for the original term, it is useful to be able to conflate variants so as to reduce the number of variants (all of which can be substituted for the original term) which must be dealt with.
Several techniques for reducing terms exist in the prior art. In the prior art, the main trend for the conflation of multi-word terms in information retrieval relies on a combination of three non-linguistic methods: empty word deletion, stemming, and grouping of single words into multi-word phrases based on concurrence information. Due to their lack of linguistic knowledge, stemming and lexical lookup conflate occurrences without conceptual relation.
Stemming reduces words to a stem, which is thought to be identical for all the words linguistically and often conceptually related. For example, `magnesia`, `magnesium`, `magnet`, `magnetic`, etc., can be conflated by a stemming algorithm and reduced to the common stem `magnes`, thus grouping together words of different meanings.
In a medical thesaurus, lexical lookup conflates `liver` and `hepatic` or `renal` and `kidney`.
Prior-art linguistic techniques for reducing multi-word term variants have focused on syntactic transformations. A technique has been developed and implemented for the simplification of syntactic variants in English. Prior-art techniques for morphological analysis have been mainly applied to natural language processing tasks. These techniques focus mainly on inflectional morphology, or derivational morphology for semantic ambiguities. Some studies on automatic analysis of derivational morphology have also been performed. There also exists work on automatic analysis of inflectional morphology and part of speech tagging through the combination of linguistic and statistical knowledge.
In the prior art, morphology has been applied only to single word terms, or has been used in natural language processing applications not involving information retrieval. Conflation of multi-word terms has typically been performed using noisy and inaccurate methods, or has focused on syntactic variants.
In order to conflate multiword terms, two steps must be taken. First, the morphological variants of single words composing terms must be conflated. Second, the whole utterances of multiword term variants must be related to the original terms.
There are several methods for conflating single word terms. The coarsest and easiest one is truncation, a nonlinguistic method. Truncation removes the endings of the words (generally a fixed length of n characters). A more precise method is morphological analysis, which is knowledge-expensive. It parses a word and produces a constituent structure whose leaves are the stem and the affixes. Intermediate in complexity between truncation and morphological analysis is stemming, which removes endings according to a reference list and may change the resulting strings with recoding functions. The recoding functions are in charge of accounting for allomorphic alternations between the different derivatives within a derivational family. For example, a recoding function may transform a final `rpt` into `rb` in order to conflate absorption and absorb. The string resulting from a stemming procedure is called a stem. It is not necessarily equal to the linguistic root, but will serve as a minimal and hopefully unambiguous denotation of the term.
The two main errors that can occur while stemming are understemming and overstemming. Overstemming is the reduction of words having similar portions but differing meanings: for example, `century` and `center` to `cent`. Understemming is reduction of words to different stems, when the reduction should be to the same stem: for example, `acquiring` to `acquir` and `acquisition` to `acquis`. Correct linguistic stemming is not necessarily semantically relevant and, furthermore, semantically correct stemming may be useless or even detrimental to information retrieval. Stemming must therefore be evaluated with respect to the task of information access. The constitution of derivational links is connected to the issue of word sense disambiguation.
Various approaches to stemming have been undertaken and evaluated. For the task of information retrieval, the use of a rich morphological stemmer enhances recall but degrades precision when compared with a minimal `s` removal stemmer.
There exists, therefore, a need in the art for techniques which combine morphological analysis and syntactic parsing to detect and conflate morphosyntactic variants through accurate and efficient methods.