Information retrieval (IR) systems have been developed that allow users to identify particular documents of interest from among a larger number of documents. IR systems are useful for finding an article in a digital library, a news story in a broadcast repository, or a particular web site on the worldwide web. To use such systems, the user specifies a query containing several words or phrases specifying areas of interest, and the system then retrieves documents it determines may satisfy the query.
Conventional IR systems use an ad hoc approach for performing information retrieval. Ad hoc approaches match queries to documents by identifying documents that contain the same words as those in the query. In one conventional IR system, an ad hoc weight is assigned to each matching word, the weight being computed from an ad hoc function of the number of times the word occurs in the document divided by the logarithm of the number of different documents in which the word appears. This ad hoc function was derived through an empirical process of attempting retrievals using the system and then modifying the weight computation to improve performance. Because conventional information retrieval systems use an ad hoc approach, accuracy suffers.