1. Field of the Invention
The present invention relates generally to searching wide area computer networks for information, and more particularly to searching the World Wide Web for topical information.
2. Description of the Related Art
The system known as the xe2x80x9cWorld Wide Webxe2x80x9d, or simply xe2x80x9cWebxe2x80x9d, as implemented by the wide area computer network known as the Internet, contains a vast amount of information in the form of Web pages. Each Web page is electronically stored in a respective Web site on a computer, referred to as a Web server, with the Web itself including many Web servers that are interconnected by means of the Internet. A person can connect a computer to the Internet via, e.g., a telephone line, and thereby electronically access the Web pages on the Web servers.
As the Web has grown, many millions of Web pages have been created. In other words, the Web contains a vast amount of information, and the content of the Web grows and changes minute by minute. It will accordingly be appreciated that some means must be provided for a person to sort through the vast quantities of Web pages to find a particular item of interest.
With the above consideration in mind, most users employ software known as Web browsers when accessing the Web. To search the Web for a particular topic of information, the user causes their Web browser to access a Web site of a centralized search engine that is maintained by a search company. Examples of currently popular search engines are Alta Vista(trademark) and Hotbot(trademark).
Centralized search engines use software referred to as xe2x80x9ccrawlersxe2x80x9d to continuously access Web pages and construct a centralized keyword index. When a person wishes to retrieve information, the person""s browser accesses a centralized search engine using a query, for example, xe2x80x9cluxury carsxe2x80x9d. In response, software at the centralized engine accesses its index to retrieve names of Web sites considered by the search engine to be appropriate sources for the sought-after information. The search engine transmits to the browser hyperlinks to the retrieved sites, along with brief summaries of each site, with the browser presenting the information to the user. The user can then select the site or sites they want by causing the browser to access the site or sites.
Owing to the burgeoning of the Web and the ever-growing amount of its information, and the fact that the above-described centralized crawler schemes posture themselves to respond to any possible query (i.e., to be all things to all people), centralized crawler/searchers require large investments in hardware and software and must never cease crawling the Web, to index new pages and to periodically revisit old pages that might have changed. Indeed, one Web search company currently requires the use of 16 of the most powerful computers made by a major computer manufacturer, each computer having 8 gigabytes of memory. Another search company currently uses a cluster of 300 powerful workstations and over one terabyte of memory to crawl over 10 million Web pages per day. Despite these heroic efforts, however, it is estimated that a single search company is able to index only 30%-40% of the Web, owing to the size of the Web which, incidentally, shows no signs of slowing its rate of expansion (currently at about one million new pages per day).
Accordingly, one problem with current technology that is recognized and addressed by the present invention is the need to reduce the vast amount of Web search hardware and software that is inherently required by a centralized search scheme.
Additionally, evaluating whether a particular Web page contains relevant information with respect to a user query is sometimes difficult. Moreover, user queries may not be effectively articulated, or they may be overbroad. Consequently, a Web search engine frequently responds to a query by returning a large number of Web pages that are of little or no interest to the requester. Nonetheless, a user must laboriously sort through hundreds and perhaps thousands of returned Web pages, which, as discussed above, can be considered to represent only 30%-40% of the total Web content in any case. Moreover, because a centralized crawler seeks the capability to respond to any query, most of the index of any single centralized system contains information that is of little or no value to any single user or indeed to any single interrelated group of users.
Thus, two other problems recognized and addressed by the present invention are the lack of focus of search results, and the fact that centralized crawlers are not tailored to any particular user or to any particular interrelated group of users and, thus, contain mostly irrelevant information, from the point of view of a single user or group of users.
In addition to the above considerations, the present invention recognizes that many if not most Web pages refer to other Web pages by means of hyperlinks, which a user can select to move from a referring Web page to a referred-to Web page. The present invention further recognizes that such hyperlinks are more than simply navigation tools; they are important tools for relevant data acquisition as well.
It happens that with the existing Web communication protocol (hypertext transfer protocol, or xe2x80x9chttpxe2x80x9d), when a user clicks on a hyperlink to a referred-to Web page v from a referring Web page u, the user""s browser sends the identity of the referring Web page u to the Web server that hosts the referred-to Web page v, and this information can be recorded or logged. Unfortunately, current logs of which Web pages refer to which other Web pages are mostly unused and indeed mostly not enabled by Web site managers, and the logs moreover consume a relatively large amount of electronic data storage space. Also, no standard way exists for a remote user to access and use the information in the logs.
The present invention, however, recognizes the above-noted problem and addresses how to exploit this currently unused but potentially valuable information in the context of resolving the unfocussed, centralized crawling problems noted above.
The invention is a general purpose computer programmed according to the inventive steps herein to generate a database of Web pages that is focussed on a predefined topic or topics, for subsequent efficient searching of the database by users. The invention can also be embodied as an article of manufacturexe2x80x94a machine componentxe2x80x94that is used by a digital processing apparatus and which tangibly embodies a program of instructions that are executable by the digital processing apparatus to generate the focussed database. This invention is realized in a critical machine component that causes a digital processing apparatus to undertake the inventive logic herein.
In accordance with the present invention, the computer includes computer readable code means for receiving a seed set of Web pages in a crawl database, with the seed set being representative of at least one topic. The computer also includes computer readable code means for identifying outlink Web pages from one or more Web pages in the crawl database, and computer readable code means for evaluating the outlink Web pages for relevance to the topic. Further, the computer includes computer readable code means for causing outlinks only of Web pages that are evaluated as being relevant to the topic to be stored in the crawl database, for subsequent evaluation.
In a preferred embodiment, computer readable code means assign a revisitation priority to a Web page, based on the means for evaluating. To embody the code means, at least one xe2x80x9cwatchdogxe2x80x9d module is provided, and the watchdog module periodically determines new and old pages to consider. The new pages are selected from the outlink Web pages, preferably those that have not yet been visited by the present logic, and the old pages are selected from pages in the crawl database. One or more worker modules respond to the watchdog module to access the new and old pages to consider. In one embodiment, in xe2x80x9cconsideringxe2x80x9d a page the present logic fetches and analyzes the page. As disclosed in detail below, each Web page is associated with a respective field (referred to in the preferred embodiment as the xe2x80x9cNum_Triesxe2x80x9d field) that represents the number of times the respective page has been accessed, and the Num_Tries field of a Web page is incremented each time the Web page is considered.
Additionally, the preferred worker module includes means for determining whether a gathering rate of relevant pages is below a xe2x80x9cpanicxe2x80x9d threshold. In one embodiment, the computer includes computer readable code means for considering all outlinks and inlinks of a Web page in the crawl database when the gathering rate is at or below the panic threshold. When the gathering rate is at or below the threshold, the scope of the topic can be increased to an expanded scope.
In another aspect, a computer system for focussed searching of the World Wide Web includes a computer that in turn includes a watchdog module for scheduling worker thread work and at least one worker module for undertaking the work. The work to be undertaken includes adding outlinks of a Web page to a crawl database, but only when the Web page is relevant to a predefined topic or topics. The crawl database is accessible by the computer and is focussed only on the topic such that the system includes no topically comprehensive database of the World Wide Web.
In still another aspect, a computer-implemented method is disclosed for building a focussed database of Web pages. The method includes receiving a search query from a user, and in response to the search query, accessing a crawl database containing only information pertaining to Web pages related to a limited number of predefined topics. A computer program product executing the method steps is also disclosed.
The details of the present invention, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts, and in which: