1. Field of the Invention
The present invention relates in general to the field of document based information retrieval in the intranet and the internet domain, and in particular to a method and system for re-ranking an existing ranked result set of documents.
2. Description of Related Art
Nowadays digital information systems provide easy access to large amounts of information. For example, users can access great quantities of information in databases on a network or even in personal computers. Mere access to large amounts of information has only limited value, however, without knowing what is useful about the information.
Searching for a certain information in a huge amount of almost unstructured information can be a very cumbersome task. Although the probability is relatively high that the desired information is somewhere existent in an existing collection of information and potentially can be found, it is at the same time covered up with all that additional unwanted information.
To retrieve a desired piece of information, there have been proposed several methods. A first approach is to structure raw data before a search is started. Structuring the raw data in advance can be done by introducing a set of taxonomies (like “cars”, “health”, “computers” etc.) and assigning each retrieved document into one or more of those categories. This can be performed in advance, e.g. by an administrator.
Before a search is started, the user has to preselect one or more categories and thereby reduce the possible amount of returned information. Only the information stored in the selected categories is returned. A simple way to accomplish this is for example having several different indexes for the search engine. The user can then select either one or more indexes as a basis for the search.
The drawback of the above described approach is that extra effort is necessary for preparing the raw data by defining the taxonomy set and by assigning each document into one or more categories (or indexes). Since the information most often is of a dynamic nature, an update of new information or categories (or both) has to be done on a regular basis. Further, there is a certain chance that some information is lost because of a wrong assignment to categories or a missing selection by the user.
A second approach is to structure a search result after having finished the search. Structuring the result of a search is either based on the input the user made when he started the search with some query terms, or an attempt is made to dynamically find similiarities inside of the documents and group (cluster) them together.
The second approach can be implemented by way of clustering the results which means finding “categories” dynamically by looking for similiarities inside of the returned documents. This can be achieved according to several different criteria, for example by scanning for lexical affinities (anyhow related expressions) and bundling those documents that have a certain similiarity. Thereby the total set of returned documents is split into several non-overlapping clusters that contain documents which are assumed to deal with the same or similiar context.
The drawback of the above second approach is that the search engine has no idea which context the user is really looking for. Moreover, the clustering of the documents is performed on the basis of word tuples (a set of search terms) that occur with a certain relation to each other in the documents. Ambiguous terms can cause some documents to be scattered all over the clusters, although from the user point of view they deal with the same context. To find those documents the user has to open and read lots of uninteresting information.
An alternative way to implement the second approach, i.e. structurung the search result after having finished the search, is to sort the returned documents in a descending order, derived by some comparable criterion. This method is commonly known as “Ranking”. The appearance of the search terms in each of the documents is a measurement for the importance of the individual document. All values are normalized so the higher the rank value is, the more importance is assumed.
Various different algorithms are used to determine the individual rank value, most often the document sizes, the index sizes, the total number of returned information or other criteria are considered.
Still another way to implement the second approach is refining the search by adding more precise search terms known as “Narrow Query”. The user starts the search process with an initial query and examines some of the returned documents. For each document he assigns a relevance value that reflects if the appropriate document is of high or low value. The search engine scans the marked documents for terms that occur with a high frequency and uses those terms to synthesize a new query. This new query favors the high frequent terms of the documents with a high relevance and excludes the terms of the documents with a low relevance.
But its drawback is that for good results the user has to examine and mark a lot of documents, otherwise the refined query is more or less random.
The predescribed approaches have common drawback that only the entered search terms can be taken into account. Since search terms often are ambiguous, they can occur in various totally different context and cause a lot of unwanted information to be returned. If, in addition, there are only a few terms entered, there is a high probability that lots of documents get the same high rank value.
Further, in an article by R. Fagin and E. L. Wimmers entitled “Incorporating user preferences in multimedia queries”, and published in Proc. 1997 International Conference on Database Theory, pp. 247–261, according to a third approach, it is proposed to weight the search terms in information retrieval with the help of user feedback and to allow to apply weights to any sort of rules. In particular, a formula is described to re-sort and re-rank the result set of a query according to the weighting of the original search terms by the user.
The drawback of the above third approach is that from common experiences in information retrieval, it is known that the typical query consists of very few search terms (1 or 2 terms). In the wast majority of all searches trying to apply any weighting to the search terms would have very little impact. The main reason for the fact that the average search is made with only one or two terms is that the user lacks the conception of what additional context information he needs to enter in order to improve the query and filter the unwanted documents. Following this scenario, the user would have to read some of the returned documents first to obtain some context information which he in turn could then use to create additional query terms. Only with a minimum set of query terms that reflect the desired context, this method would make sense.
Finally, all of the above cited prior art approaches have common drawback that, if the preparation of the raw data as described above is not feasible or in a situation, where information can be lost because of a user error, e.g. by selecting a wrong index, only a method that does a post search improvement is acceptable.
Thereupon, all of the described approaches require a relatively high effort with opening and reading the contents of documents, thereby wasting time with useless information.