The World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web documents. Users navigate these web documents by means of computer software programs commonly known as Internet browsers. Due to the vast number of WWW sites, many web documents have a redundancy of information or share a strong likeness in either function or title. The vastness of the unstructured WWW causes users to rely primarily on Internet search engines to retrieve information or to locate businesses. These search engines use various means to determine the relevance of a user-defined search to the information retrieved.
A typical search engine has an interface with a search window where the user enters an alphanumeric search expression or keywords. The search engine sifts through available web sites for the user's search terms, and returns the search of results in the form of HTML documents. Each search result comprises a list of individual entries that have been identified by the search engine as satisfying the user's search expression. Each entry or “hit” may comprise a hyperlink that points to a Uniform Resource Locator (URL) location or web document. Examples of currently popular search engines are Google and Alta Vista.
The authors of web documents provide information known as metadata within the body of the hypertext markup language (HTML) document that defines the web documents. Centralized search engines use software referred to as “web crawlers” or “crawlers” to continuously access Web documents and construct a centralized keyword index. The crawler systematically accesses web documents by sequentially following hypertext links, or out-links, from document to document. The crawler indexes the web documents for use by the search engines using information about a web document as provided by its address or Universal Resource Locator (URL), metadata, and other criteria found within the web document. The crawler is run periodically to update previously stored data and to append information about newly created web documents. The information compiled by the crawler is stored in a metadata repository or database. The search engines search this repository to identify matches for the user-defined search rather than attempt to find matches in real time.
When a person wishes to retrieve information, the person's browser accesses a centralized search engine using a query, for example, “luxury cars”. In response, software at the centralized engine accesses its index to retrieve names of Web sites considered by the search engine to be appropriate sources for the sought-after information. The search engine transmits to the browser hyperlinks to the retrieved sites, along with brief summaries of each site, with the browser presenting the information to the user. The user can then select the site or sites they want by causing the browser to access the site or sites.
Owing to the burgeoning of the Web and the ever-growing amount of its information, centralized crawler/searchers require large investments in hardware and software and should never cease crawling the Web to index new web documents and to periodically revisit old web documents that might have changed. One Web search company currently requires the use of 16 of the most powerful computers made by a major computer manufacturer, each computer having 8 gigabytes of memory. Another search company currently uses a cluster of 300 powerful workstations and over one terabyte of memory to crawl over 10 million Web documents per day.
Despite the resources expended, it is estimated that a single search company is able to index only 30%-40% of the Web, owing to the size and rate of expansion of the Web. Further, the Web shows no signs of slowing its rate of expansion, which is currently at about one million new web documents per day. In addition to the cost of equipment, a conventional crawler wastes bandwidth in a search that locates documents of any type. Bandwidth is very expensive. Further, the equipment used by the crawler has limitations in storage capacity and speed. Crawling every web document regardless of usefulness or interest minimizes the efficiency of the crawler and maximizes the cost to operate the crawler.
Additionally, evaluating whether a particular Web document contains relevant information with respect to a user query is sometimes difficult. Moreover, user queries may not be effectively articulated, or they may be overbroad. Consequently, a Web search engine frequently responds to a query by returning a large number of Web documents that are of little or no interest to the requester. Nonetheless, a user may laboriously sort through hundreds and perhaps thousands of returned Web documents, which, as discussed above, can be considered to represent only 30%-40% of the total Web content in any case. Moreover, because a centralized crawler seeks the capability to respond to any query, most of the index of any single centralized system contains information that is of little or no value to any single user or indeed to any single interrelated group of users.
One solution to a centralized crawler is a focused crawler. A focused crawler crawls the Web searching for documents and pages that match the focus topic. Although this technology has proven to be useful, it would be desirable to present additional improvements. The conventional focused crawler focuses only on one topic.
For a search engine to crawl the Web for multiple focus topics, multiple instances of the focused crawler should be run. For example, a search engine runs focus crawlers for the topics petroleum, music, and technology. Three focus crawlers may crawl the web searching for documents that match the focus criteria. This approach requires adequate administration and manpower to manage those three focus crawlers. In addition, even though these topics seem very different, they may still have some pages or documents in common. For example, each of the focus crawlers may crawl a news website seeking web documents that relate to that topic. This implies that the search engine is searching the same news website three times (once for each focus crawler) each day, for example, searching for out-links of interest to the focus crawler.
Searching the same website for each focus crawler wastes resources for both the search engine and the web site being crawled. This issue is magnified when a search engine operates many hundreds of focus crawlers. What is therefore needed is a system, a service, a computer program product, and an associated method for a focus crawler that can manage multiple focus topics while crawling the Web, minimizing the number of times a web document is crawled and maximizing computing and bandwidth resources of the search engine. The need for such a solution has heretofore remained unsatisfied.