Existing general purpose search engines provide valuable assistance to users in locating the information relevant to their needs on the World Wide Web. However, they are unsatisfactory when users try to find in-time information for a narrow query in a specific domain. It is estimated that only 30-40% of the Web pages are collected and put into the search engine index by the largest crawls, and the complete refreshing takes several weeks to a month, thus much of the up-to-date information is out of the search scope. Another drawback of general search engines is: it makes a loss of much information in the Web pages, while it enables fast searching to build a content index.
“Focused crawling” is recognized as a promising solution to satisfy the above search requirements. It can collect the useful information with a very limited resource. For example, users are already using PC based “Focused crawling” implementations. It can also exploit plentiful information hidden in the original web pages as well as the web topology to make more accurate judgment about the relevance.
“Focused crawling” is an intelligent way to crawl the Word Wide Web and collects only the web pages relevant to a specific information need. In particular, the “crawler” begins from a “seed” web page and intelligently visits other web pages following the links in the “seed” web page. And then, the “crawler” follows the links in visited web pages. As this process goes on, the number of possible links or their target web pages increases in an explosion way. The challenge is how to make the “crawler” visit as many relevant web pages as possible given that the number of total visited web pages is limited by time, network bandwidth and other resource restrictions. In the implementation of a “crawler”, the challenge is boiled down to make decisions on which among the unvisited web pages should be visited in priority.
Known ranking methods only took advantage of “local” information in the web page. The “local” information in a web page includes the number of in-links, keywords and their positions, etc. However, the paths do contain valuable information for “focused crawling”. For example, usually you can find research projects on artificial intelligence with a path like “University homepage”-“Academies”-“College (school, department) of Computer Science”-“Research Areas”. Actually, people share a similar knowledge structure and they cope the structure when building the web site thus make similar patterns. This invention meets the challenge with a novel way to rank candidate web pages (represented by the URLs pointing them) by path-based ranking the pages so that optimal crawling is made.