1. Field of the Invention
The present invention is generally related to locating information in search spaces comprising a plurality of documents, and more particularly to a focused search engine employing a method of focused crawling which incorporates both topic distillation and site distillation techniques.
2. Description of the Related Art
The rapid growth of the World Wide Web (the Web) poses a significant scalability problem for general purpose search engines; using conventional searching techniques, the typical search engine is required to collect a very large number of Web pages in order to perform indexing and classification during the process of locating pages which are relevant to the search. Focused searching, or xe2x80x9ccrawling,xe2x80x9d is a relatively new approach designed to address the scalability issue and to provide higher quality search results, i.e. Web pages that are more relevant with respect to a request for information concerning a specific topic.
An example of crawling involves a request for information related to a broad topic, for example, xe2x80x9cjogging,xe2x80x9d focused on a more particular category. Relative to the broad topic of jogging, example queries or requests for information which may be issued to focused search engines might include, inter alia: a request for documents related to jogging in a sports category; a request for documents related to jogging in a health/fitness category; or a request for shopping sites which sell or specialize in merchandise related to jogging. Ideally, these three exemplary requests, each having a different fundamental objective, should return different results, given the different focus and purpose of each respective search.
One important shortcoming inherent in the methods of query processing employed by conventional search engines involves keyword frequency. Specifically, pages or documents containing keywords with a high term frequency (i.e. multiple occurrences of the same keyword) are routinely ranked higher in terms of relevance than other pages or documents which might potentially contain more relevant material, but which do not contain as many occurrences of the keyword. For example, in conventional searching techniques, a document containing five separate occurrences of a particular keyword is generally considered by the search engine to be more relevant than a document containing only three occurrences of that same keyword, even though the latter document may contain more insightful or more relevant information relative to the real objective of the search.
Another approach adopted by some conventional search techniques involves utilizing various methods of xe2x80x9clink analysisxe2x80x9d for the purpose of ranking pages according to relevance. Link analysis involves examining the hyperlinks which connect the various hyperlinked pages in the search space, and is based upon the theory that pages which contain similarly relevant material will be within a relatively small link radius. In other words, where a page has been identified as containing relevant material, these search methods seek additional pages which are linked to the known relevant page by exploring outgoing links and examining the pages that are accessible by those links. These methods are systematic and mechanical searches for keywords, however; neither the focus nor the context of the search is considered. As a consequence, many pages which are not related to the real objective of the search are, nevertheless, identified as relevant by such search methods.
There has been a continuing and growing need, therefore, for a focused search engine employing a method of focused searching which takes into consideration the hierarchical structure of Web pages while integrating techniques for both topic distillation as well as site distillation in crawling or focused searching. A method of focused searching, or a focused search engine employing such a method, should recognize different classifications of information and identify category-specific search terms which will assist in finding the most relevant documents related to an issued query or other request for information.
The method embodied in the focused search engine of the present invention addresses the foregoing and other shortcomings of conventional search engines by coordinating advanced topic distillation and site distillation technologies. In particular, the focused search methodology described herein is based upon hierarchically structured Web document classification categories; that is, a given Web document is typically categorized and indexed, to some extent, according to the subject matter addressed in its content, is linked to other documents or pages, for example, via hyperlinks, and is ordinarily located relatively xe2x80x9cclosexe2x80x9d to other related documents containing similar information. The term xe2x80x9cclosexe2x80x9d in this context refers to the relatively few hyperlinks required to navigate from one page to another page containing similar or related subject matter.
The various embodiments of a focused search engine may be related to two different schemes, each of which addresses an important aspect of focused crawling, namely, providing relevant information in usable form.
In accordance with one aspect of the present invention, for example, a focused search engine and method generally organize information responsive to a request or a query and present the results according to categories. By organizing search results in this manner, a focused search engine employing the inventive method described herein may provide categorized information; a search may easily be narrowed by selection of a particular category of interest from those categories recognized by the search engine.
In accordance with another aspect of the present invention, a focused search engine and method require or request the specification of explicit categories of interest as part of the original search. In other words, the category of interest which is the ultimate focus of the search may be determined or suggested by the system based upon the hierarchical structure of the Web classification system; alternatively, one or more categories of interest may be specified explicitly during formulation of the request for information.
Identifying the topic categories to be searched, or having those topic categories specified at the outset, enables a focused search engine and method to perform implicit query expansion; that is, category-specific keywords may be added, either by the system itself or by a user, such that the search engine may be able to distill (or to identify) documents or pages which are relevant both to the selected category as well as to the query keyword or the original request for information.
For example, when a search engine employing the inventive focused searching method is issued a request for information including a query keyword xe2x80x9cjogging,xe2x80x9d several broad topic categories may either be identified by the search engine or specified as part of the original request. These topic categories may include, for example, health/fitness, sports, or shopping sites hosted by footwear manufacturers or retailers. The focused search techniques of the present invention identify category-specific keywords related to each respective topic category; the presence or absence of these category-specific keywords in a particular document may be considered as a factor in determining relevance.
Where the search is for pages containing the keyword xe2x80x9cjoggingxe2x80x9d in the health/fitness category, for example, the pages containing category-specific keywords, such as xe2x80x9cexercisexe2x80x9d or xe2x80x9caerobicxe2x80x9d in this example, may be ranked higher than those pages which contain many occurrences of the keyword xe2x80x9cjoggingxe2x80x9d outside of the health/fitness context. Implicit query expansion involves the identification and addition of these category-specific keywords.
As will be described in detail below, the various embodiments of a focused search engine and method may be developed based upon hierarchically structured Web pages. Such an approach to focused Web crawling involves four key techniques: topic distillation on hierarchically structured Web pages; query processing which takes into consideration search or topic category specification; site distillation which takes into consideration focused topics; and integration of topic distillation and site distillation for focused crawling.