1. Technical Field
The present disclosure generally relates to searching information. More particularly, and without limitation, the present disclosure relates to methods and systems for creating an adaptive thesaurus and for enhancing a search using an adaptive thesaurus.
2. Background Information
With vast amounts of information being stored in electronic form, search tools help users find specific information they are looking for. For example, Internet search engines enable users to search for specific information on the Internet, and database search tools enable users to search for specific information stored in large databases. However, conventional search techniques have several problems, discussed below.
In the search field, the term “recall” refers to the proportion of all relevant documents in a corpus of documents that is retrieved by a search. In a Boolean full-text search engine, a query for “automobile” will fail to retrieve or “recall” any text that refers to the concept of automobiles using the term “car.” Therefore, a user who searches for “automobile” may fail to find important and desired documents containing text that instead discuss automobiles using the term “car.” Expanding the search query to “automobile OR car” will retrieve or “recall” the text missed by the “automobile” query. Accordingly, one strategy for improving recall is to enhance a query by expanding the original terms of the query with synonyms obtained from a thesaurus.
However, in general, no two terms are perfectly synonymous, and thus expansion of one term with a second term will typically result in a loss of precision. That is, searching for “automobile OR car” rather than just “automobile” will likely return texts with references to railroad car, which is not encompassed in the automobile concept being searched by the user. The inclusion of such texts that are irrelevant to automobiles would therefore diminish the “precision” of the search result, which refers to the proportion of all retrieved documents that are relevant to a given concept. If precision falls too low, a simple query expansion may fail to effectively enhance the search.
A second method of expanding a query to enhance the recall of texts pertaining to a concept is known as “stemming.” For example, the concept of “to consider” can be referenced in a text by any of the following morphological variants of “to consider,” i.e., consider, considers, considered, considering, and consideration. These variants can each be used to expand the other. However, as with the example of the railroad car, expanding “consider” with a non-synonymous morphological variant (e.g., considerate) will undesirably diminish precision, again failing to enhance the search.
Thus, there are several potential problems associated with query expansion. As the preceding examples illustrate, although query expansion increases recall by increasing the number of documents retrieved, it also normally reduces precision. This follows mathematically from the fact that the number of retrieved documents appears in the denominator of the formula for calculating precision. Queries must therefore be expanded to increase recall without significantly decreasing precision.
Another problem with query expansion is data glut. A data glut occurs when a search returns more texts than can be analyzed by the user. Since query expansion normally results in the recall of more texts, query expansion often entails a risk of creating a data glut. To mitigate this problem, query expansion may be accompanied by a relevance ranking system. A popular ranking algorithm called “term frequency-inverse document frequency” (TF-IDF) can rank texts returned by a search by “relevance” and order the most relevant retrieved texts at the top of a result set, thereby mitigating the data glut problem. Even still, expansion of a query with terms that occur too frequently or that are insufficiently synonymous can still create a data glut that the ranking algorithms cannot sufficiently mitigate.
Furthermore, words of natural languages may be polysemous (have multiple meanings). For example, in the English language, the word “bow” may be a gesture, a weapon, the front of a ship, or a decoration. Thus, using a conventional thesaurus to expand a search query for “prow” with “bow” will retrieve many texts unrelated to prow and thereby appreciably diminish precision. Conventional, general-purpose thesauri are therefore unsuited to specific domains of knowledge, because they contain weak or false synonyms that unacceptably diminish precision. Conversely, special-purpose thesauri are unsuited to general domains, because they may not contain commonly-accepted synonyms, and may fail to adequately expand queries to enhance recall.
In addition, conventional statistical thesauri (also known as association thesauri) use co-occurrence matrices, wherein terms that co-occur in a text are deemed synonyms. However, such synonyms do not comport with the usual linguistic definition of synonyms as terms that individually refer to a single concept. For example, the terms gun and bullet often co-occur in the same document. Consequently, conventional statistical methods of thesaurus construction will find the terms gun and bullet in frequent co-occurrence and will consider these two terms synonyms. Therefore, context-free expansion of a term with such false synonyms can lead to a considerable loss of precision.
In view of the foregoing, there is a need for improved methods and systems that provide accurate search results.