Each and every document, including patents and publications, cited herein is incorporated herein by reference in its entirety as though recited in full.
The present invention relates generally to data processing apparatus and corresponding methods for the retrieval of data stored in a database or as computer files. More particularly, the present invention relates to methods and systems to facilitate refinement of queries intended to specify data to be retrieved from a target data collection.
A significant trend today is the rapid growth in the amount of information available in electronic form. For example, unprecedented amounts of textual and symbolic information are becoming available on intranets and on the Internet as a whole. Unfortunately, tools for locating information of interest in these large collections are quite limited.
Typically, when searching for information on an intranet or the Internet, a user creates a query that is intended to specify a particular information need. An information retrieval system then interprets this query and searches a target data collection to identify items in the collection that are relevant to the query. The retrieval system then retrieves these items, or abstracts thereof, and presents them to the user. In the process of presentation, it is desirable for the retrieved items, or abstracts thereof, to be ranked in the order of their applicability to the expressed information need of the user. Unfortunately, both formation of well-focused queries and informative ranking of retrieved items are quite difficult to do well.
One factor contributing to the difficulty of retrieving information of interest from target collections, such as large full-text databases, is the imprecise nature of human languages. The richness of human language is a strength in expressing ideas with full conceptual generality. In addition, all human languages incorporate significant elements of ambiguity. Both richness and ambiguity create problems from a retrieval perspective.
Approaches to text retrieval are confronted with the fact that multiple words may have similar meanings (synonymy) and a given word may have multiple meanings (polysemy). The typical English word, for example, has at least a half dozen close synonyms. In addition, there generally are a much larger number of broader and narrower terms that are related to any given word of interest. It is often infeasible for a user to anticipate all ways in which an author may have expressed a given concept. A user may, for example, consider using the word car in a query. An author of a document, however, may have used a different term such as automobile or horseless carriage. The author also may have used a more general term, such as vehicle, or narrower terms, such as Ford or Mustang. Failure to include all of these variants in a query will lead to incomplete retrievals.
Polysemy creates a complementary problem. Most words in most languages have multiple meanings. In English, for example, the word fire has several common meanings. It can be used as a noun to describe a combustion activity. It also can be used as a verb meaning to terminate employment or to launch an object. A particularly polysemous word, such as strike, has dozens of common meanings. For the 2000 most polysemous words in English, the typical verb has more than eight common senses, and the typical noun has more than five. Using such a word in a query can result in much extraneous material being retrieved. For example, a person using the word strike in a query, with the intent of retrieving material on labor actions also will be presented material on baseball, air strikes, striking of oil, people who strike up a conversation, etc.
In information retrieval, two metrics generally are applied in evaluating incompleteness and imprecision of retrievals. Recall is a measure of the completeness of retrieval operations. For any given query and any given collection of documents, recall is defined as the fraction of the relevant documents from the collection that are retrieved by the query. Precision is defined as the fraction of retrieved documents that are, in fact, relevant to the users information need. Both metrics typically are expressed as percentages. Historically, text retrieval systems typically have operated at recall and precision levels in the neighborhood of 20 to 30 percent. As the size of full-text databases has grown, however, these numbers have declined.
Information retrieval from the Internet offers a good example of the problems presented by: the tension between precision and recall; and large target collection volume. The types of queries typically formulated by users of Internet search engines frequently result in identification of tens of thousands of web pages as potentially relevant (see FIG. 1). Upon examination, the vast majority of these pages typically turns out to be irrelevant, i.e., contribute to low precision. That, itself, would be a lesser problem if the results were accurately ranked, i.e., if the most relevant web page was returned as the first result, the next most relevant as the second result, etc. Unfortunately, the quality of ranking as provided by current tools is typically less than optimal. If the search criteria are narrowed to increase precision, some relevant documents might be excluded leading to lower recall.
Most text retrieval systems accept queries in one of two forms, as Boolean logical constructions or as natural language inputs. Boolean constructions involve words connected by Boolean logical operators (e.g., AND, OR, NOT). For example, in response to the query: bear AND NOT (teddy OR beanie), a retrieval system using Boolean queries would retrieve documents that discuss bears, but not those that discuss teddy bears or beanie baby bears. Natural language inputs can take the form of sentences that are produced by the user. Alternatively, documents or portions of documents may be used as queries.
In theory, for any given information need and target information collection, a Boolean query could potentially be constructed that could be used to retrieve relevant information from the target collection with 100% recall and 100% precision. For realistic queries and collections, however, the corresponding Boolean construct might be very large. Some people find it difficult or uneconomical to create complex Boolean queries. This is strongly demonstrated by the observation that, on average, queries used with Internet search engines consists of slightly over two terms. Similar averages are also seen for queries employed on large intranets in government and industry.
While some users are capable of forming non-trivial queries containing more than two terms, some users have learned that modest increases in initial query complexity bring relatively limited increases in the quality of results. Better results are typically obtained when the user reviews selected documents that are retrieved and then makes iterative modifications to the query. Even in this case, however, there are restrictions. First, the initial query often constrains the scope of subsequent iterations. Not knowing what relevant information was excluded by the initial query, the user has few, if any, clues available that indicate how to modify the query to include missed relevant information in subsequent retrievals. Second, it is often difficult to think through what query modifications would be desired in order to improve a given result. The amount of time and effort required to produce effective Boolean queries with current tools is greater than many users invest.
One technique shown to be of considerable value when directly applied to text retrieval is Latent Semantic Indexing (LSI). See S. T. Dumais, LSI meets TREC: A status report, THE FIRST TEXT RETRIEVAL CONFERENCE (TREC1), NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY SPECIAL PUBLICATION 500-207, pp. 137-152 (1993) [DUMAIS I]; S. T Dumais, Latent Semantic Indexing (LSI) and TREC-2, THE SECOND TEXT RETRIEVAL CONFERENCE (TREC2), NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY SPECIAL PUBLICATION 500-215, 105-116 (1994) [DUMAIS II]; S. T. Dumais, Using LSI for information filtering: TREC-3 experiments, THE THIRD TEXT RETRIEVAL CONFERENCE (TREC3) NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY SPECIAL PUBLICATION (1995) [DUMAIS III]. Text retrieval solutions using LSI make use of a vector space representation of document contents to extract presumed underlying semantic meaning from a target collection. LSI has the following desirable features:
Language independence;
The ability to identify relevant documents, which do not include words specified in the query; and
Independence of subject matter.
LSI techniques are described in Scott C. Deerwester, et al, Indexing by Latent Semantic Analysis, JOURNAL OF THE SOCIETY FOR INFORMATION SCIENCE, 41(6), pp. 391-407 (October, 1990) and U.S. Pat. No. 4,839,853, entitled COMPUTER INFORMATION RETRIEVAL USING LATENT SEMANTIC STRUCTURE to Deerwester et al. issued Jun. 13, 1989. The optimality of this technique is shown in C. Ding, A Similarity-based Probability Model for Latent Semantic Indexing, PROCEEDINGS OF THE 22ND ANNUAL SIGIR CONFERENCE, Berkeley, Calif. (August 1999) [DING].
LSI techniques operate on a collection of text passages, typically referred to in the literature of the art as xe2x80x9cdocuments.xe2x80x9d The term xe2x80x9cdocumentxe2x80x9d, however, in this case can includes paragraphs, pages, or other subdivisions of text and not necessarily or only documents in the usual sense (i.e., externally defined logical subdivisions of text). Along these lines, this discussion refers to the text passages of interest as xe2x80x9cdocuments.xe2x80x9d
The use of LSI is illustrated with reference to FIG. 2. As a preliminary step in using the LSI technique, a large sparse matrix (the Txc3x97D matrix) is formed. Each row in the Txc3x97D matrix corresponds to a term that appears in the documents of interest, and each column corresponds to a document. Each element (m,n) in the matrix corresponds to the number of times that the word m occurs in document n. Referring to FIG. 3, the known technique of singular value decomposition (SVD) can be used to reduce the Txc3x97D matrix to a product of three matrices, including a matrix that has non-zero values only on the diagonal. Small values on this diagonal, and their corresponding rows and columns in the other two matrices are then deleted. This truncation process is used to generate a vector space of reduced dimensionality as illustrated in FIG. 4 by recombining the three truncated matrices in to (Txc3x97D)xe2x80x2 matrix. Both terms and documents are located at specific positions in this new vector space.
The primary application of latent semantic indexing has been in the area of information retrieval. In that application, queries are treated as pseudo-documents. Documents are ranked in terms of similarity to the query based on a cosine measure between the vector corresponding to the query and the vector corresponding to that document in the LSI vector space. Experiments have shown that closeness of documents in this sense is a good proxy for closeness in terms of information content. See [DUMAIS I], [DUMAIS II], [DUMAIS III]. In known applications, the LSI approach is used to conduct the overall search operations and to provide a ranking of the results.
The present invention provides significant benefits over existing systems and methods for retrieval of data.
According to one aspect of the invention, a vector space representation of a subset of a retrieved data set is used along with user feedback for generating hypotheses from which candidate modified queries can be derived.
According to another aspect of the invention, a vector space representation of a subset of a retrieved data set is used, along with user feedback, to test whether inclusion of one or more hypotheses in a modified query results in an improvement of the relevancy of the search to the user""s needs.
According to other aspects of the invention, a vector space representation of a subset of a retrieved data set is used, along with user feedback, to generate and test one or more hypotheses regarding modifications to strategies for ranking a subset of results of a data retrieval operation.