Contemporary information systems and networks, including large localised and distributed databases, the Internet generally, and the World Wide Web (“web”) in particular, contain huge quantities of information. However, locating documents containing information of particular interest remains a challenging problem. In response to this need, various strategies for locating and ranking documents of interest, generally in response to specific search queries provided by users, have been developed. An important application of such methods is that of searching for documents on the web, and a number of web search engines, including Google, Yahoo, AltaVista, Lycos and so forth, are well known to Internet users around the world.
The function of web search engines is to identify and rank documents, or more specifically web pages, of interest to a user. The search process typically begins with the user providing a search query, which is typically a list of search terms. The search engine then attempts to identify documents of particular interest to the user based upon the search query. The most common method by which this is achieved involves matching the various terms included in the search query with the contents of a corpus of pre-stored or otherwise suitably indexed web pages and/or other documents. Documents that include the terms appearing in the search query are generally known as “hits”. Most search engines make some attempt to rank the hits in order of relevance before returning a corresponding list of documents to the user.
A number of methods for ranking hits are known in the art. The simplest approach is to rank hits according to the number of terms in the search query appearing within each identified document. Accordingly, a document that contains all of a user's search terms is considered to be more relevant than a document including only some of the search terms. The main problem inherent in this simple approach is that it is wholly dependent upon the quality of the search terms selected by the user. Many common words are insufficiently specific to enable relevant documents to be distinguished from largely irrelevant documents, and the onus lies with the user to select specific search terms having a particular likelihood of appearing only in relevant documents. Experience suggests that this is an unreasonable expectation to place upon users, and particularly those users who are relatively inexperienced in the use of keyword-based search engines.
Various strategies have been employed to improve the quality of search results returned in response to simple keyword search queries. For example, it is known to be generally beneficial for the search engine to ignore words that are known to be particularly common, and to be of little significance in assessing the relevance of documents to a particular search query. Obvious examples of words having little or no significance in relation to the relevance of a document include (in English) “a”, “the”, “and”, “or” and so forth. This strategy goes some way towards enabling users to enter plain language queries, however it cannot provide a guarantee that there is a high correlation between the less common words that are not ignored and the relevance of the documents returned to the user's actual search requirement.
Additional methods have therefore been proposed and/or implemented for improving the relevance of search results based upon information that is extrinsic to the user's search query. For example, hits may be ranked according to factors including the apparent popularity or relevance of the documents to other users who have performed similar searches in the past, the degree of interest of the documents inferred from the extent to which other documents refer or link to them, and/or other feedback received from previous users of the search engine. The inherent disadvantage of all of these types of approach is that the main criteria in assigning a high ranking to a particular document is its general level of “popularity”, and not its particular relevance to a specific user's search query.
While statistically there may be a general correlation between popularity and relevance, this is generally only the fortuitous result of the fact that, over time, a large population of users of a search engine will frequently tend to search for similar information. This should not be mistaken for a genuine and intrinsic ability of the search engine to provide appropriately ranked and highly relevant results in response to any single specific search query.
A further limitation of search engines based upon conventional keyword search queries is that a query including a larger number of search terms is, in general, likely to produce a smaller number of highly relevant documents. This apparently counter-intuitive result arises because, unless the search terms are particularly well-chosen by the user, the chances of all of them occurring, even in a highly relevant document, may be relatively small. This is unfortunate, because it would be highly desirable, particularly for inexperienced users of search engines, to be able to enter a relatively lengthy description of the information of interest to the user, for example in a natural language form, thereby relieving the user of the responsibility for preselecting the most appropriate search terms.
Additionally, when a user is initially researching a particular topic of interest, they may not even be aware initially of the terms that are of greatest relevance to the topic of interest. However, as the user's research proceeds and they begin to acquire additional knowledge of the topic, they may begin to identify additional key terms that could be used to improve the relevance of the search results. Unfortunately, inexperienced and/or untrained researchers often do not appreciate this fact, and therefore do not recognise the value of applying their improving knowledge of the topic to update their original search query. Furthermore, conventional search engines do not provide any encouragement or support for this kind of continuing improvement of search queries.
It will be appreciated that a number of the foregoing limitations of known search engines are, to some extent, a consequence of the fact that the basic interface for providing input to such search engines has remained substantially unchanged for many years. In particular, the conventional search engine interface enables the user to enter a limited search query consisting primarily of a small number of user-selected keywords. While the search query may be retained during a single session, or may be saved for subsequent retrieval (either on the search server system or, more typically, on the user's computer), this basic interface does not enable the search query to evolve in any meaningful way as the user's understanding of the topic improves. Equally significantly, while the user may be maintaining notes, comments, printouts and other records of their research as they proceed, the search engine is unable to take advantage of any of this information in order to improve the quality of search results.
One previous attempt to mitigate some of these limitations is disclosed in U.S. Pat. No. 6,353,825 (5 Mar. 2002), issued to Jay Michael Ponte (“Ponte”). Ponte describes a system and method in which a user is enabled to select one or more documents identified in an initial search based on a conventional search query, wherein the selected documents are those considered by the user to be of particular relevance to a topic of interest. In accordance with Ponte, the selected documents are analysed to identify distinguishing terms appearing therein, that are assumed to be relevant to the search topic, and the words thus extracted are used as the basis for an improved search. However, this approach remains unsatisfactory in a number of respects. For example, it is based solely upon the content of documents selected by the user, who is thus still potentially required to review multiple irrelevant documents in order to identify one or more documents of possible relevance. Additionally, the user can only review and select those documents that are identified in the initial search, and the effectiveness of Ponte thus remains strongly dependent upon the quality of the user's own initial search query which, as noted above, may be problematic. Further, the Ponte system and method is unable to distinguish between portions of a document that are relevant to the user's interest and those that are not. Accordingly, Ponte remains unable to utilize the user's own notes and comments on a topic, as these evolve in the course of ongoing research.
Accordingly, it is an object of the present invention to provide improved methods, systems and software products for conducting searches of large numbers of web pages or other documents, which mitigate the various abovementioned limitations of conventional search engines.