1. Field of the Invention
The present invention relates generally to an information retrieval system, and more specifically to an information retrieval system adapted to improve ranking of documents retrieved in response to short queries.
2. Description of the Background Art
An information retrieval (IR) system is a computer-based system for locating, from an on-line source database or other collection, documents that are relevant to a user's input query. Until recently, most commercial IR systems, such as DIALOG.RTM. or LEXIS.RTM., used Boolean search technology. In a Boolean search system, users must express their queries using the Boolean operators AND, OR, and NOT, and the system retrieves just those documents that exactly match the query criteria. Typically, there is no score or other indication of how well each document satisfies the user's information need.
However, after years of research demonstrating the superiority of relevance-ranking, commercial systems began to offer this capability. Today millions of people use IR systems that employ relevance-ranking, also known as ranked searching, which is based on the "vector space model." In a relevance-ranked search system, users can simply type an unrestricted list of words, even a "natural-language" sentence, as their query. The system then does a partial matching computation and assigns a score to every document indicating how well it matches the user's interest. Documents are then presented to the user in order, from the best matching to the least matching. Relevance-ranking is described in Salton, et al., Introduction To Modern Information Retrieval, McGraw-Hill Book Co., New York (1983). Relevance-ranking IR systems are commonly used to access information on the Internet, through systems based on the WAIS (Wide Area Information Servers) protocol or through a variety of commercial World Wide Web indexing service such as Lycos, InfoSeek, Excite, or Alta Vista. Relevance-ranking is also used in commercial information management tools such as AppleSearch, Lotus Notes and XSoft Visual Recall for searching databases or collections from individual or shared personal computers.
Relevance-ranking systems work as follows. In relevance-ranking, each word in every document of a collection is first assigned a weight indicating the importance of the word in distinguishing the document from other documents in the collection. The weight of the word may be a function of several components: (1) a local frequency statistic (e.g., how many times the word occurs in the document); (2) a global frequency statistic (e.g., how many times the word occurs in the entire collection of documents); (3) the DF measure (how many documents in the collection contain the word); and (4) a length normalization statistic (e.g., how many total words are in the document).
The following example demonstrates one possible term-weighting scheme for a relevance-ranking system. First, assume that a collection contains one-hundred (100) documents with one particular document containing only the text "the dog bit the cat." Assume further that the word "the" occurs in all 100 documents while the word "dog" occurs in five (5) documents and the word "cat" occurs in two (2) documents. Here, we use Term Frequency (TF), the number of times the word occurs in a particular document, as our local frequency statistic:
term=dog, TF=1, PA1 term=the, TF=2, PA1 term=cat, TF=1. PA1 term=dog, DF=5/100, PA1 term=the, DF=100/100, PA1 term=cat, DF=2/100, PA1 term=dog, IDF=100/5=20, PA1 term=the, IDF=100/100=1, PA1 term=cat, IDF=100/2=50. PA1 term=dog, TF.times.IDF=1.times.20=20, PA1 term=the, TF.times.IDF=2.times.1=2, PA1 term=cat, TF.times.IDF=1.times.50=50.
Here, we use DF as our global statistic:
where DF=number of documents containing the term total number of documents.
The inverses of DF (IDF) are calculated as follows:
For this example, we will not use a length normalization statistic. Thus the final weights of each term using TF.times.IDF are as follows:
This list of weighted terms serves as the vector that represents the document. Note that terms found in more documents (such as "the") have lower weights than terms found in fewer documents (such as "cat"), even if they occur more frequently within the given document.
Every document in the collection is then assigned a vector of weights, based on various weighting methods such as TF.times.IDF weighting and weighting that takes TF.times.IDF and a length normalization statistic into account. After a query is entered, the query is converted into a vector. A similarity function is used to compare how well the query vector matches each document vector. This produces a score for each document indicating how well it satisfies the user's request. One such similarity function is obtained by computing the inner product of the query vector and the document vector. Another similarity function computes the cosine of the angle between the two vectors. Based on relevance-ranking, each document score is calculated and the retrieved documents are then outputted sequentially from the one with the highest score to the one with the lowest score.
A study performed by D. E. Rose and D. R. Cutting on an experimental information retrieval system by Apple Computer, Inc. of Cupertino, Calif. shows that casual users of IR systems prefer to issue short queries. During a four-week period from December 1995 to January 1996, over 50% of the 10,044 queries issued by at least 4,686 users in Apple's system contained only a single word, and no query was longer than 12 words. The mean query length was 1.76 words. A subsequent study performed by Rose and Cutting shows that out of 10,000 queries issued in Apple's system, over 53% were single-word queries and 94% were queries of three words or less. Similar results were obtained for queries placed in systems by Excite and the THOMAS system provided by the federal government. Rose, Daniel E. and Cutting, Douglass R., Ranking for Usability: Enhanced Retrieval for Short Queries, (submitted for publication, September 1996). Other studies have confirmed the preference of casual users for issuing short queries. Hearst, Marti A., Improving Full-Text Precision On Short Queries Using Simple Constraints, Fifth annual Symposium on Document Analysis and Information Retrieval, pp. 217-225 (1996).
The interfaces of the major Internet search services also encourage queries having few terms. The four well-known World Wide Web searching services (Lycos, InfoSeek, Excite, and AltaVista) present users with an entry field that accepts less than one line of text.
The statistical methods that provide relevance-ranking, such as "TF.times.IDF weighting" with the cosine similarity metric, attempt to "reward" documents that are well-characterized by each query term. In practice, this means that a document that has a very high value for some of the query terms may be ranked higher than a document that has a lower value for more of the query terms. Relevance-ranking algorithms are intended to achieve this outcome. However, users sometimes find that for short queries submitted to relevance-ranking IR systems, the users' goal of obtaining the most useful ordering of search results, from the most relevant document to the least relevant document, is not attained. Existing relevance-ranking algorithms may, in some circumstances when a query is short, assign higher scores to certain documents with low overlaps than to other documents with high overlaps. Overlap is determined by the number of terms common between the query and the document. This problem is exemplified in FIG. 1, which is a table partially showing the results of a short query entered into the Apple Developer web site. The query term entered by a user was "express modem," whereby the user probably intended to retrieve documents about the Apple Computer product by that name. The search results included 103 documents, and the documents with the top ten relevance scores are shown in FIG. 1. Column I shows the ranking of the search results based on relevance scores indicated by the symbol *. Column II shows the titles of the retrieved documents, and column III identifies which terms in the query were responsible for the documents being retrieved. The highest scoring document contained only the term "modem," as shown in row (a). This document discussed modems in general, without mentioning the term "express modem." The second highest scoring document contained only the term "express" (as shown in row (b)) and was not relevant to modems. The third highest scoring document did discuss the "express modem" product, as shown in row (c).
FIG. 1B shows the method used by the prior art to produce the results described above with reference to FIG. 1. The method starts with step 150, where a query defining the search criteria is issued to a database or other information retrieval system. Next, step 155 identifies a set of documents that meet the criteria defined in the query. Finally, step 160 assigns a relevancy ranking to each of the documents in the identified set using conventional relevancy ranking algorithms discussed above. A possible solution to the short query ranking problem discussed above is to use queries based on Boolean search technology. However, the Boolean approach sacrifices the benefits of relevance-ranking, while research has shown that most casual users do not understand Boolean logic and have difficulty in using Boolean IR systems. Attempts have been made to ease user problems with Boolean systems with solutions that blend Boolean and relevance-ranking. Noreault T., Koll, M., and McGill, M. J., Automatic Ranked Output From Boolean Searches In SIRE, Journal of the American Society For Information Science, Vol. 26, No. 6, pp. 333-39 (1977); and Salton, G, Fox, E. A., and Wu, H., Extended Boolean Information Retrieval, Communications of the ACM, Vol. 26, No. 12, pp. 1022-1036 (1983). However, the above approaches combine Boolean and relevance-ranking, and consequently users are still required to express their queries as Boolean expressions if they wish to take advantage of the Boolean constraints. In addition, the above approaches do not take query length into account when scoring the relevance of documents.
G. Salton and C. Buckley have suggested that the statistical weighting of a short query should differ from the statistical weighting of a long query. Salton, G. and Buckley, C., Term-Weighting Approaches In Automatic Text Retrieval, Information Processing & Management, Vol. 24, No. 5, pp. 513-523 (1988). Salton and Buckley did not, however, suggest that a matching algorithm should be modified as a function of query length, nor did they propose a function that changes the statistical weighting scheme of query terms as the query lengthens or shortens.
One study that notes the short query problem is by Hearst, Marti A., Improving Full-Text Precision On Short Queries Using Simple Constraints, Fifth Annual Symposium on Document Analysis and Information Retrieval, pp. 217-225 (1996). However, this approach limits ranking within the confines of the Boolean search, and only if users input their query in a prescribed way. In addition, this approach imposes limitations on users in their method of query input, and does not take query length into account. Additionally, although Hearst's system is described as targeting "short" queries, it appears to be optimized for much longer queries (8 words or more) than most users actually enter.
Furthermore, none of the above approaches work on an arbitrary relevance-ranking system.
Thus, there is a need for a system and method that overcome the short query problem of relevance-ranking information retrieval systems.