The present invention relates generally to computer-implemented techniques for information retrieval. More specifically, it relates to query-based search engines and methods for improved searching through the use of contextual information.
The basic goal of a query-based document retrieval system is to find documents that are relevant to the user""s input query. Since a typical query comprises only a few words, prior art techniques are often unable to discriminate between documents that are actually relevant and others that simply happen to use the query terms.
Conventional search engines for unstructured text documents can be divided into two groups: keyword-based, in which documents are ranked on the incidence (i.e., the existence and frequency) of keywords provided by the user, and categorization-based, in which information within the documents to be searched, as well as the documents themselves, are pre-classified into xe2x80x9ctopicsxe2x80x9d that are used to augment the retrieval process. The basic keyword search is well-suited for queries in which the topic can be described by a unique set of search terms. This method selects documents based on exact matches to these terms and then refines searches using Boolean operators (and, not, or) that allow users to specify which words and phrases must and must not appear in the returned documents. Constructing Boolean search queries is considered laborious and difficult for most people to use effectively. Moreover, unless the user can find a combination of words appearing only in the desired documents, the results will generally contain too many unrelated documents to be of use.
Several improvements have been made to the basic keyword search. Query expansion is a general technique in which keywords are used in conjunction with a thesaurus to find a larger set of terms with which to perform the search. Query expansion can improve recall (i.e., results in fewer missed documents) but usually at the expense of precision (i.e., results in more unrelated documents) due in large part to the increased number of documents returned. Natural language parsing falls into the larger category of keyword pre-processing in which the search terms are first analyzed to determine how the search should proceed (e.g., Infoseek""s Ultraseek Server). For example, the query xe2x80x9cWest Bankxe2x80x9d comprises an adjective modifying a noun. Instead of treating all documents that include either xe2x80x9cwestxe2x80x9d or xe2x80x9cbankxe2x80x9d with equal weight, keyword pre-processing techniques can instruct the search engine to rank documents that contain the phrase xe2x80x9cwest bankxe2x80x9d more highly. IBM""s TextMiner makes extensive use of query expansion and keyword pre-processing methods, recognizing xcx9c105 commonly used phrases. Even with these improvements, keyword searches may fail in many cases where word matches do not signify overall relevance of the document. For example, a document about experimental theater space is unrelated to the query xe2x80x9cexperiments in spacexe2x80x9d but may contain all of the search terms.
Categorization methods attempt to improve the relevance by inferring xe2x80x9ctopicsxe2x80x9d from the search terms and retrieving documents that have been predetermined to contain those topics. The general technique begins by analyzing the document collection for recognizable patterns using standard methods such as statistical analysis (e.g., Excite""s Web Server) and neural network classification (e.g., Autonomy""s Agentware). As with all such analyses, word frequency and proximity are the parameters being examined and/or compiled. Documents are then xe2x80x9ctaggedxe2x80x9d with these patterns (often called xe2x80x9ctopicsxe2x80x9d or xe2x80x9cconceptsxe2x80x9d) and retrieved when a match with the search terms or their associated topics have been determined. In practice, this approach performs well when retrieving documents about prominent (i.e., statistically significant) subjects. Given the sheer number of possible patterns, however, only the strongest correlations can be discerned by a categorization method. Thus, for searches involving subjects that have not been pre-defined, the subsequent search typically relies solely upon the basic keyword matching method is susceptible to the same shortcomings.
It is appropriate to note here that many categorization techniques use the term xe2x80x9ccontextxe2x80x9d to describe their retrieval processes, even though the search itself does not use any contextual information (i.e., how collections of words appear relative to one another in order to define a context). U.S. Pat. No. 5,619,709 to Caid et. al. is an example of a categorization method that uses the term xe2x80x9ccontextxe2x80x9d to describes various aspects of their search. Caid""s xe2x80x9ccontext vectorsxe2x80x9d are essentially abstractions of categories identified by a neural network; searches are performed by first associating, if possible, keywords with topics (context vectors), or allowing the user to select one or more of these pre-determined topics, and then comparing the multidimensional directions of these vectors with the search vector via the mathematical dot product operation (i.e., a projection). In many respects, this process is identical to the keyword search in which word occurrence vectors are projected on a keyword vector.
U.S. Pat. No. 5,926,812 to Hilsenrath et. al. describes an improvement to the ranking schemes of conventional search engines using a technique similar to categorization. Hilsenrath""s application is rather specialized in that the search relies upon first having a set of documents about the topic of interest in order to retrieve more like it, rather than the more difficult task of finding related documents using only a limited set of keywords provided by the user. Hilsenrath""s method first analyzes a set of documents and extracts a xe2x80x9cword clusterxe2x80x9d which is analogous to the xe2x80x9ctopicsxe2x80x9d described above. The words defined by this word cluster are then fed to an existing keyword-based search engine, which returns a set of documents. These documents are then re-ranked by comparing the original word cluster with similar ones computed for each document. Although the comparison step does use context-like information (e.g., word pair proximities), the overall method is fundamentally limited by the fact that it requires already having local documents related to the topic of interest. The quality of the search is also limited by the quality and completeness of these local documents. In some sense, it is really an improvement to a xe2x80x98more like this documentxe2x80x99 search feature than a complete query-based document retrieval method.
The disclosed invention provides a computer-implemented method for improving query-based document retrieval using the vast amount of contextual information (i.e., information about the relationships between words) within the document collection to be searched. The goals of a context-based search are 1) to take advantage of this wealth of information in order to automatically facilitate the specification of search topics from a limited set of search terms, and 2) to rank documents using a measure of contextual similarity that takes into an account the usage and relationship of all words within a each document. Essentially, words related to a particular topic tend to frequently occur in close proximity to one another. When the user specifies a query, the information compiled from the document collection can be enough to determine the desired topic and reconstruct the associated context (e.g., related words, usage proportions, etc.). Documents with the greatest contextual similarity are most likely to be relevant to the search topic.
As implied in the Background Art, categorization techniques also rely upon the fact that even without a xe2x80x9csemanticxe2x80x9d understanding of a text, highly useful contextual information can be read and compiled from a document collection by analyzing the statistics of word occurrence. But unlike categorization, context-based searching recognizes that in order to use this contextual information effectively, all of the statistics must be kept. Instead of analyzing the compiled data beforehand and extracting a limited set of correlations and patterns, the present invention stores all of the compiled contextual information for use at the time of the search. In other words, this technique is effective for queries about all topics. From just a few query terms supplied by the user, this technique is able to perform a true contextual search, that is, one in which word frequencies, word pair (and higher order) proximities, and proportions of word appearance are essential elements of the retrieval and ranking process.
The feature that fundamentally differentiates the present invention from all other search engines is its utilization of a context database (xe2x80x9cC-dbxe2x80x9d), a repository of unfiltered (i.e., not pre-categorized) word frequency and proximity statistics compiled from the documents to be searched. Using the query terms supplied by the user, the context search method computes a search matrix to retrieve relevant documents. A search matrix is a set of numerical values that specify a topic of interest within a multidimensional word relationship space. This matrix contains information not only about words and phrases used to describe the desired subject matter but about how these words and phrases appear relative to each other. It is important to note that search matricies are not and cannot be pre-determined; they are computed at the time of the search using the query terms supplied by the user. The added word relationship information from this search matrix draws upon information from the entire document collection to be searched and thus allows the subsequent search to apply a much more rigorous set of criteria for determining the relevance of each document to the topic of interest.
The context-based search comprises five general steps, not necessarily performed in the following order. First, the C-db is generated by analyzing word usage at the same time the search engine indexes a document collection. At a minimum, parameters such as word frequency, occurrence counts, and proximities are compiled during this step. Next, for each query, one or more search matrices are computed from the query terms using the contents of the C-db. The third step involves generating (or retrieving) a word relationship matrix for each document, thus preparing for the fourth step, a matrix comparison calculation that produces the final weight used to rank the relevance of each document. An optional fifth step of refining the search may be performed by replacing the original search matrix (derived from the query terms) with the word relationship matrix from a particular document (i.e. a xe2x80x98more like this documentxe2x80x99 function).
The present approach improves on prior art in several important ways. The additional information provided by the context database and, in turn, the search matrix narrows the description of the desired topic: the more specific the area of interest is defined, the more accurate the results. Since the search uses the contextual (i.e., word relationship) information contained in the search matrix, documents are ranked not on the simple occurrence of individual words but on how these words appear in relation to each other. That is, a document must not only use relevant words in proximity to each other, but these words must be used in the proper proportions as well. This feature is particularly important for searches about topics described by common words (i.e., words associated with numerous subjects) because it is precisely the subtleties within collections of words, rather than the individual words themselves, that distinguishes one topic from another.
Defining a search matrix also allows useful generalization because the search is not limited to the keywords provided by the user. Relevant documents are found even when the original search terms are not present. Moreover, this method is able to recognize and discard documents which use the original search terms but do not contain the proper context. In short, the context search method improves both precision and recall specifically in situations that are most troublesome for conventional keyword and categorization search engines: queries about less popular topics that are difficult to define with a limited number of search terms.