Finding information on the World Wide Web has become increasingly difficult with the growth of the Web, and frequently resembles a search for a needle in a haystack. General-purpose search engines typically return large quantities of irrelevant information, which the user must sift and refine. In order to search effectively and obtain high-quality search results, users are required to engage in an interactive process, typically including the following steps:                Choose a search engine and submit a query.        Traverse the list of retrieved pages to find the relevant ones.        Apply shallow browsing based on outgoing hyperlinks from the set of retrieved pages.        Provide relevance feedback for “more like this” services.        Refine the query repeatedly and resubmit it (possibly to other search engines).Since searching the Web for precise information in this manner requires iterative user feedback, users must be connected to the Internet and interacting with the computer throughout an entire search session.        
This model of interactive searching does not accord well with pervasive computing devices, which are being used increasingly for Internet access. Such devices include personal digital assistants (PDAs), hand-held computers, smart phones, TV browsers, wearable computers, and other mobile devices. Typically, pervasive devices are used to make only brief network connections while the user is outside the office or home. Furthermore, by their nature, pervasive devices are much less facilitative of user interactivity than are desktop computers. There is therefore a need for more precise, non-interactive, “one-shot” search services, for users of both pervasive devices and desktop computers.
A number of Web sites offer tools that are intended to make searching more efficient. For example, Internet Search Agent (ISA) is a Java Web search tool that queries several popular search engines, automatically downloads the results, and then displays them on the user's browser. ISA can be configured as an unattended download agent that retrieves Web pages for viewing offline, or as an improved search engine that returns entire Web pages, rather than just a title and several lines of text. ISA is non-interactive, but it does not attempt to autonomously improve the precision of the user's search results.
SearchPad is an intelligent agent for Web search, metasearch and resource classification. It supports basic and advanced Boolean queries. It also allows users to specify a “phrase neighborhood” to search, in terms of words, sentences, and paragraphs. SearchPad offers “accept” and “reject” rules to support screening of results and allows users to give feedback by rating documents that it finds. These user preferences are reused for similar, subsequent searches and for defining search topics. Thus, SearchPad can learn rules and definitions of topics, but it is highly interactive and relies on the user either to build the rules explicitly or to provide relevance feedback by indicating which keywords make a page relevant or irrelevant.
SmartRanker is a ranking search engine that attempts to anticipate the user's information needs. It sends an intelligent agent to get search results from a number of popular Internet search engines. The results are analyzed, filtered, grouped and re-ranked by a ranking agent using a human-created knowledge base. The SmartRanker Web site does not specify how the knowledge base is built or specifically how the re-ranking is performed.
Karnak is a search service that guides the user through the process of building search queries that are structured to provide precise information. Karnak then searches the Web, adding what it considers to be the best information to a personal library that is created for each user. The library can be accessed from any Internet-capable computer. Karnak checks for dead and stale links before providing results and regularly updates users by e-mail on the status of their research.
Automatic query expansion has been recognized as an efficient tool for improving user search results. It is usually performed by adding terms related to the terms specified by the user, using a thesaurus or synonym table. Xu and Croft describe and compare a number of techniques of query expansion, for example, in “Query Expansion using Local and Global Document Analysis,” published in Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1996), which is incorporated herein by reference. U.S. Pat. Nos. 4,823,306 and 5,987,457, whose disclosures are similarly incorporated herein by reference, also describe methods of query refinement in the context of text searching.
Web crawling can be used as a search technique to find pages having hyperlinks to or from a root site that is known to be relevant to the user's query. These linked pages are often relevant to the query, as well, even when they do not contain the exact search terms used in the query. The CLEVER crawler uses hypertext classification and topic distillation tools to focus its work within a specific topic domain, while ignoring unrelated and irrelevant material. This focused crawler is described by Chakrabarti et al., in “Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery,” published in Proceedings of the Eighth World Wide Web Conference (Toronto, 1999), and incorporated herein by reference.
Another system that combines Web search and crawling is Fetuccino-Alfredo, described by Ben-Shaul, et al., in “Adding Support for Dynamic and Focused Search with Fetuccino,” also published in Proceedings of the Eighth World Wide Web Conference (Toronto, 1999), and incorporated herein by reference. In this system, users provide a broad domain in which the search should be performed, in addition to their specific query. Fetuccino-Alfredo first identifies sites related to the broad domain, using a general-purpose search engine, and then dynamically searches for the narrow query by traversing the domain sites and their close neighbors.
A number of techniques have been proposed for topic distillation, so that the most authoritative pages in a collection of linked pages can be identified. One such technique is described by Kleinberg in “Authoritative Sources in a Hyperlinked Environment,” published in Proceedings of the Ninth ACM-SIAM Symposium on Discrete Algorithms (1998) and incorporated herein by reference. Aspects of this technique are also described in U.S. Pat. No. 5,884,305, whose disclosure is incorporated herein by reference, as well. Kleinberg proposes and tests an algorithmic formulation of the notion of “authority,” based on the mutually-reinforcing relationship between a set of relevant, authoritative pages and a set of “hub pages” that join them together in a link structure. The relationship is used to compute hub and authority scores for the nodes in a graph of linked pages, indicating which of the pages are the most authoritative.
Another technique of this sort is described by Lempel and Moran in “The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect,” published in Proceedings of the Ninth World Wide Web Conference (Amsterdam, 2000), and incorporated herein by reference. SALSA examines random walks on graphs derived from the link structure of a collection of Web pages. The authors show that their approach uses the same meta-algorithm as does Kleinberg but is more efficient and, in some cases, more effective in identifying the meaningful authorities.