A traditional information retrieval (IR) system allows a user to search a large data repository for specific information by accepting a user-input search token, and returning a subset of the repository that matches the search token. For example, the search token can be a word or phrase, and the matches returned by the IR system can be all those documents of the repository that contain this word or phrase. In order to fulfill this function, the information retrieval system contains some form of a look-up table, which lists all possible search tokens, each along with all the documents in which the token appears. Alternatively, the data in the repository can be organized in a way that enables the search of certain descriptive elements, such as bibliographic data, of the individual documents, so that the IR system determines matches based on these descriptive elements rather than on the entire contents of the repository. While IR systems as described above are useful for small as well as highly structured data repositories, they become inefficient with increasing size of the data collections, in particular for loosely structured or unstructured data.
A particularly striking example of a data repository for which traditional IR systems fail is the World Wide Web (the “Web”). An IR system basing search results solely on the occurrence of the search token on web sites would typically deliver many millions of search results, thus placing a significant burden on the user to narrow down the search with more sophisticated and/or more comprehensive search tokens. Current Web search engines therefore augment traditional IR methods by ranking search results that match the search tokens according to one or more additional criteria. One such criterion is the general popularity of each web site relative to others, as it can be measured, for example, in terms of the user traffic to the site or the number of links it receives from other sites. The latter approach, which exploits the hyperlink structure of the Web, is based on the rationale that the number of hyperlinks a web site receives from other sites is indicative of its quality or authority. Authority ranking methods typically determine the authority of each site recursively in terms of the authorities of all the sites linking to it and/or from it. While they have improved on traditional IR systems, the continuing growth of the Web renders their generic, context-independent use of authority increasingly insufficient, as the number of search results with similarly high authority often exceeds the number of results a human user could reasonably review. Moreover, the increasing number of Internet users comes along with a diversification in information needs, which is not adequately reflected in a ranking scheme that gives each link essentially equal weight regardless of context.
Accordingly, there is a need for improved search and ranking methods, which diversify search results delivered in response to a certain search token based on the context of the search and the needs of the particular user.