The present invention relates generally to methods and systems for computerized searching in large bodies of data, and specifically to efficient and effective search methods for use on the World Wide Web.
Finding information on the World Wide Web has become increasingly difficult with the growth of the Web, and frequently resembles a search for a needle in a haystack. General-purpose search engines typically return large quantities of irrelevant information, which the user must sift and refine. In order to search effectively and obtain high-quality search results, users are required to engage in an interactive process, typically including the following steps:
Choose a search engine and submit a query.
Traverse the list of retrieved pages to find the relevant ones.
Apply shallow browsing based on outgoing hyperlinks from the set of retrieved pages.
Provide relevance feedback for xe2x80x9cmore like thisxe2x80x9d services.
Refine the query repeatedly and resubmit it (possibly to other search engines).
Since searching the Web for precise information in this manner requires iterative user feedback, users must be connected to the Internet and interacting with the computer throughout an entire search session.
This model of interactive searching does not accord well with pervasive computing devices, which are being used increasingly for Internet access. Such devices include personal digital assistants (PDAs), hand-held computers, smart phones, TV browsers, wearable computers, and other mobile devices. Typically, pervasive devices are used to make only brief network connections while the user is outside the office or home. Furthermore, by their nature, pervasive devices are much less facilitative of user interactivity than are desktop computers. There is therefore a need for more precise, non-interactive, xe2x80x9cone-shotxe2x80x9d search services, for users of both pervasive devices and desktop computers.
A number of Web sites offer tools that are intended to make searching more efficient. For example, Internet Search Agent (ISA) (www.renegade-software.com /ISA) is a Java Web search tool that queries several popular search engines, automatically downloads the results, and then displays them on the user""s browser. ISA can be configured as an unattended download agent that retrieves Web pages for viewing offline, or as an improved search engine that returns entire Web pages, rather than just a title and several lines of text. ISA is non-interactive, but it does not attempt to autonomously improve the precision of the user""s search results.
SearchPad (www.searchpad.com) is an intelligent agent for Web search, metasearch and resource classification. It supports basic and advanced Boolean queries. It also allows users to specify a xe2x80x9cphrase neighborhoodxe2x80x9d to search, in terms of words, sentences, and paragraphs. SearchPad offers xe2x80x9cacceptxe2x80x9d and xe2x80x9crejectxe2x80x9d rules to support screening of results and allows users to give feedback by rating documents that it finds. These user preferences are reused for similar, subsequent searches and for defining search topics. Thus, SearchPad can learn rules and definitions of topics, but it is highly interactive and relies on the user either to build the rules explicitly or to provide relevance feedback by indicating which keywords make a page relevant or irrelevant.
SmartRanker (www.tooto.com/smartranker.html) is a ranking search engine that attempts to anticipate the user""s information needs. It sends an intelligent agent to get search results from a number of popular Internet search engines. The results are analyzed, filtered, grouped and re-ranked by a ranking agent using a human-created knowledge base. The SmartRanker Web site does not specify how the knowledge base is built or specifically how the re-ranking is performed.
Karnak (www.karnak.com) is a search service that guides the user through the process of building search queries that are structured to provide precise information. Karnak then searches the Web, adding what it considers to be the best information to a personal library that is created for each user. The library can be accessed from any Internet-capable computer. Karnak checks for dead and stale links before providing results and regularly updates users by e-mail on the status of their research.
Automatic query expansion has been recognized as an efficient tool for improving user search results. It is usually performed by adding terms related to the terms specified by the user, using a thesaurus or synonym table. Xu and Croft describe and compare a number of techniques of query expansion, for example, in xe2x80x9cQuery Expansion using Local and Global Document Analysis,xe2x80x9d published in Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1996), which is incorporated herein by reference. U.S. Pat. Nos. 4,823,306 and 5,987,457, whose disclosures are similarly incorporated herein by reference, also describe methods of query refinement in the context of text searching.
Web crawling can be used as a search technique to find pages having hyperlinks to or from a root site that is known to be relevant to the user""s query. These linked pages are often relevant to the query, as well, even when they do not contain the exact search terms used in the query. The CLEVER crawler (www.almaden.ibm.com/cs/k53/clever.html) uses hypertext classification and topic distillation tools to focus its work within a specific topic domain, while ignoring unrelated and irrelevant material. This focused crawler is described by Chakrabarti et al., in xe2x80x9cFocused Crawling: A New Approach to Topic-Specific Web Resource Discovery,xe2x80x9d published in Proceedings of the Eighth World Wide Web Conference (Toronto, 1999), and incorporated herein by reference.
Another system that combines Web search and crawling is Fetuccino-Alfredo, described by Ben-Shaul, et al., in xe2x80x9cAdding Support for Dynamic and Focused Search with Fetuccino,xe2x80x9d also published in Proceedings of the Eighth World Wide Web Conference (Toronto, 1999), and incorporated herein by reference. In this system, users provide a broad domain in which the search should be performed, in addition to their specific query. Fetuccino-Alfredo first identifies sites related to the broad domain, using a general-purpose search engine, and then dynamically searches for the narrow query by traversing the domain sites and their close neighbors.
A number of techniques have been proposed for topic distillation, so that the most authoritative pages in a collection of linked pages can be identified. One such technique is described by Kleinberg in xe2x80x9cAuthoritative Sources in a Hyperlinked Environment,xe2x80x9d published in Proceedings of the Ninth ACM-SIAM Symposium on Discrete Algorithms (1998) and incorporated herein by reference. Aspects of this technique are also described in U.S. Pat. No. 5,884,305, whose disclosure is incorporated herein by reference, as well. Kleinberg proposes and tests an algorithmic formulation of the notion of xe2x80x9cauthority,xe2x80x9d based on the mutually- reinforcing relationship between a set of relevant, authoritative pages and a set of xe2x80x9chub pagesxe2x80x9d that join them together in a link structure. The relationship is used to compute hub and authority scores for the nodes in a graph of linked pages, indicating which of the pages are the most authoritative.
Another technique of this sort is described by Lempel and Moran in xe2x80x9cThe Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect,xe2x80x9d published in Proceedings of the Ninth World Wide Web Conference (Amsterdam, 2000), and incorporated herein by reference. SALSA examines random walks on graphs derived from the link structure of a collection of Web pages. The authors show that their approach uses the same meta-algorithm as does Kleinberg but is more efficient and, in some cases, more effective in identifying the meaningful authorities.
In preferred embodiments of the present invention, knowledge agents with domain specialization enable users to apply precise, xe2x80x9cone-shotxe2x80x9d searching on the Web. There is no need for the user to be connected to the Internet or to interact with the search engine during the search process. This capability is especially important for users of pervasive devices, but is also useful to users of desktop computers and workstations. The knowledge agent receives the user""s query and carries out the search by simulating the steps involved in the conventional interactive search process. The user can thus disconnect while the agent is searching and can receive the search results the next time he or she connects to the Internet or by e-mail.
Each knowledge agent specializes in a domain by extracting relevant information every time it performs a search. It uses the information to improve the precision of subsequent search efforts. To this end, the knowledge agent automatically maintains a knowledge base, which stores this information persistently. The knowledge base comprises a set of leading sites in its domain and a repository of terms that appear in these sites, including their lexical affinities. The knowledge base is preferably initialized by providing a set of sites relevant to the domain of interest. Then, after each search, the knowledge agent evaluates the search results and, as appropriate, adds to the knowledge base new pages that were found in the search to be highly relevant, possibly taking the place of old pages with lower utility.
In terms of user interaction, the knowledge agent acts as an intermediary between the user and one or more Web search engines, preferably managing the entire search process for the user. For each search, the user chooses the knowledge agent that has the relevant specialization, typically a knowledge agent that the user has initialized and used in previous searches. Alternatively, the knowledge agent may imported from another user or from a repository of agents available to the public. Preferably, the knowledge agent is imported simply by copying the agent""s knowledge base. Thereafter, the user may keep and refine the knowledge agent for his or her own particular domain of interest.
Although domain-focused search engines and Web crawlers are known in the art, as described in the Background of the Invention, none of them make use of persistent, acquired knowledge in a domain that is defined and then refined by a user, as do preferred embodiments of the present invention. This unique, focused knowledge base makes it practical for xe2x80x9cone-shotxe2x80x9d searching without user interaction. Deployment of the knowledge agent as a xe2x80x9cfront endxe2x80x9d to existing search engines, together with the portability of personalized knowledge agents among different computers and different users, makes these embodiments of the present invention easy to use, particularly in the environment of pervasive devices.
In some preferred embodiments of the present invention, when the user submits a search query to the knowledge agent, the agent first refines the query based on its knowledge of the user""s domain of interest. Optionally, the user has the opportunity to edit the refined query. It then passes the refined query to a number of search engines, most preferably based on the user""s indicated preferences. The knowledge agent analyzes the initial search results and then retrieves additional pages pointing to and from these pages according to their relevance to the query and to the domain of interest. The knowledge agent applies a ranking algorithm to this expanded set of pages. Preferably, the algorithm takes into account textual affinity to the particular query and to the domain of interest, as well as topological information for finding the most xe2x80x9cauthoritativexe2x80x9d pages. The ranked list of pages is returned to the user via e-mail or upon request, typically the next time the user initiates a communication with the agent. In addition, the knowledge agent updates its knowledge of the domain and of the user""s interests based on this search, so as to refine the knowledge base for the next search.
Although preferred embodiments are described herein with reference to searching on the World Wide Web, it will be appreciated that the principles of the present invention are also applicable, mutatis mutandis, to searching in other large bodies of linked information.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for searching a corpus of documents, including:
defining a knowledge domain;
identifying a set of reference documents in the corpus pertinent to the domain;
inputting a first query;
searching the corpus using the set of reference documents to find one or more of the documents in the corpus that contain information in the domain relevant to the first query; and
adding at least one of the found documents to the set of reference documents for use in searching the corpus for information in the domain relevant to a second, subsequent query.
Preferably, inputting the first query includes inputting one or more search terms, wherein searching the corpus includes finding lexical characteristics of terms in the reference documents and refining the search terms using the lexical characteristics. Additionally or alternatively, inputting the first query includes specifying one or more documents representative of the information to be found in the corpus.
Further preferably, searching the corpus includes searching the corpus to find the documents that contain the information relevant to the query and ranking the found documents by comparing them to the set of reference documents. Most preferably, ranking the found documents includes evaluating a textual resemblance between the found documents and the reference documents. Alternatively or additionally, ranking the found documents includes assessing links between the found documents and the reference documents. Further preferably, adding the at least one of the found documents includes adding at least the document having the highest ranking.
Preferably, adding the at least one of the found documents includes removing one of the documents from the set responsive to adding the at least one of the found documents. Most preferably, the method includes tracking a level of relevance of the reference documents to the queries, and removing the one of the documents includes removing one of the reference documents whose tracked level of relevance is low.
In a preferred embodiment, the corpus includes at least a part of the World Wide Web, and the documents include Web pages, and searching the corpus includes conveying the query to one or more Web search engines. Typically, inputting the first query includes receiving the query from a user of a pervasive device, and searching the corpus includes searching while the device is disconnected from the Web.
Preferably, identifying the set of reference documents includes opening one or more files of a knowledge base on a computer in which data regarding the reference documents are saved. In a preferred embodiment, identifying the set of reference documents includes identifying the set of documents used by a first user in searching the corpus, and opening the one or more files includes copying the files for use by a second user in searching the corpus for information in the domain.
There is additionally provided, in accordance with a preferred embodiment of the present invention, a method for searching a corpus of documents containing terms, including:
defining a knowledge domain;
identifying a set of reference documents in the corpus pertinent to the domain; finding lexical characteristics of the terms in the reference documents;
inputting a search query;
refining the search query using the lexical characteristics; and
searching the corpus to find information in the domain responsive to the refined query.
Preferably, finding the lexical characteristics includes finding lexical affinities among the terms, wherein the search query includes search terms, and wherein refining the search query includes adding to the search terms further terms found to have lexical affinity to the search terms.
There is also provided, in accordance with a preferred embodiment of the present invention, a method for searching a corpus of linked documents containing terms, including:
defining a knowledge domain;
identifying a set of reference documents in the corpus pertinent to the domain;
inputting a search query;
searching the corpus to find one or more of the documents in the corpus that contain information relevant to the query;
evaluating a textual resemblance between the found documents and the reference documents so as to assign respective textual scores to the found documents;
assessing links between the found documents and the reference documents so as to assign respective topological scores to the found documents; and
ranking the found documents with respect to their relevance to the domain responsive to the textual scores and the topological scores.
Preferably, evaluating the textual resemblance includes assessing, for each of a plurality of the terms in the found documents, a respective frequency of occurrence in the reference documents.
In a preferred embodiment, the documents include World Wide Web pages, and assessing the links includes generating a graph of the links between the pages and calculating authority weights of the nodes of the graph.
There is further provided, in accordance with a preferred embodiment of the present invention, apparatus for searching a corpus of documents, including:
a memory, adapted to store an identification of a set of reference documents in the corpus pertinent to a predefined knowledge domain; and
a search processor, which responsive to receiving a first query as input, is adapted to search the corpus using the set of reference documents to find one or more of the documents in the corpus that contain information in the domain relevant to the first query, and to add at least one of the found documents to the set of reference documents stored in the memory for use in searching the corpus for information in the domain relevant to a second, subsequent query.
There is moreover provided, in accordance with a preferred embodiment of the present invention, apparatus for searching a corpus of documents containing terms, including:
a memory, adapted to store an identification of a set of reference documents in the corpus pertinent to a predefined knowledge domain; and
a search processor, which is adapted to find lexical characteristics of the terms in the reference documents, and responsive to receiving a query as input, is adapted to refine the search query using the lexical characteristics and to search the corpus to find information in the domain responsive to the refined query.
There is furthermore provided, in accordance with a preferred embodiment of the present invention, apparatus for searching a corpus of linked documents containing terms, including:
a memory, adapted to store an identification of a set of reference documents in the corpus pertinent to a predefined knowledge domain; and
a search processor, which responsive to receiving a query as input, is adapted to search the corpus to find one or more of the documents in the corpus that contain information relevant to the query, to evaluate a textual resemblance between the found documents and the reference documents so as to assign respective textual scores to the found documents, to assess links between the found documents and the reference documents so as to assign respective topological scores to the found documents, and to rank the found documents with respect to their relevance to the domain responsive to the textual scores and the topological scores.
There is additionally provided, in accordance with a preferred embodiment of the present invention, a computer software product for searching a corpus of documents, the product including a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to receive a definition of a knowledge domain and an identification of a set of reference documents in the corpus pertinent to the domain, and further cause the computer, responsive to a first query, to search the corpus using the set of reference documents to find one or more of the documents in the corpus that contain information in the domain relevant to the first query, and to add at least one of the found documents to the set of reference documents for use in searching the corpus for information in the domain relevant to a second, subsequent query.
There is also provided, in accordance with a preferred embodiment of the present invention, a computer software product for searching a corpus of documents, the product including a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to receive a definition of a knowledge domain and an identification of a set of reference documents in the corpus pertinent to the domain and to find lexical characteristics of the terms in the reference documents, and further cause the computer, responsive to a query, to refine the search query using the lexical characteristics and to search the corpus to find information in the domain responsive to the refined query.
There is further provided, in accordance with a preferred embodiment of the present invention, a computer software product for searching a corpus of documents, the product including a computer-readable medium in which program instructions are stored, which instructions, when read by a computer, cause the computer to receive a definition of a knowledge domain and an identification of a set of reference documents in the corpus pertinent to the domain, and further cause the computer, responsive to a query, to search the corpus to find one or more of the documents in the corpus that contain information relevant to the query, to evaluate a textual resemblance between the found documents and the reference documents to assign respective textual scores to the found documents, to assess links between the found documents and the reference documents to assign respective topological scores to the found documents, and to rank the found documents with respect to their relevance to the domain responsive to the textual scores and the topological scores.
The present invention will be more fully understood from the following detailed description of the preferred embodiments thereof, taken together with the drawings in which: