The present invention relates generally to data processing systems and, more particularly, to an improved information retrieval system.
Information retrieval (IR) systems have been developed that allow users to identify particular documents of interest from among a larger number of documents. IR systems are useful for finding an article in la digital library, a news story in a broadcast repository, or a particular web site on the worldwide web. To use such systems, the user specifies a query containing several words or phrases specifying areas of interest, and the system then retrieves documents it determines may satisfy the query.
Conventional IR systems use an ad hoc approach for performing information retrieval. Ad hoc approaches match queries to documents by identifying documents that contain the same words as those in the query. In one conventional IR system, an ad hoc weight is assigned to each matching word, the weight being computed from an ad hoc function of the number of times the word occurs in the document divided by the logarithm of the number of different documents in which the word appears. This ad hoc function was derived through an empirical process of attempting retrievals using the system and then modifying the weight computation to improve performance. Because conventional information retrieval systems use an ad hoc approach, accuracy suffers.
Methods and systems consistent with the present invention provide an improved IR system that performs information retrieval by using probabilities. When performing information retrieval, the improved IR system utilizes both the prior probability that a document is relevant independent of the query as well as the probability that the query was generated by (would be used to retrieve) a particular document given that the particular document is relevant. By using these probabilities, the improved IR system retrieves documents in a more accurate manner than conventional systems which are based on an ad hoc approach.
In accordance with methods consistent with the present invention, a method in a data processing system having information items is provided. This method receives a query containing a query word from a user, determines a likelihood that at least one of the information items is relevant given the query word, and provides an indication that the at least one information item is likely relevant to the query word.
In accordance with systems consistent with the present invention, a data processing system is provided containing a secondary storage device with documents, a memory with a query engine, and a processor configured to run the query engine. The query engine is configured to receive a query with query words indicating a relevant one of the documents and configured to utilize a formula to determine which among the documents is the relevant document. The formula is based on a model for how the query words were generated to express a need for the relevant document.