Although there are a large number of websites on the Internet or World Wide Web (www), users often are only interested in information on specific web pages from some websites. For examples, students, professionals, and educators may want to easily find educational materials, like online courses from a particular university. The marketing department of an enterprise may want to know the evaluations of customers, the comparison between their products and those from their competitors, and other relevant product information. Accordingly, various search engines are available for specific websites.
One approach to discovering domain-specific information is to crawl all of the web pages for a website and use a classification tool to identify the desired or “target” web pages. Such an approach is only feasible with a large amount of computing resources, or if the website only has few web pages. A more efficient way to discover domain-specific information is known as focused crawling. One challenge of implementing efficient focused crawling is to determine the likelihood that a page may quickly lead to target pages.
Two well-known examples are the HITS algorithm and variations of the PageRank algorithm, such as, Personalized PageRank (PPR), and Dynamic Personalized PageRank (DPPR). These algorithms rank pages according to topic relevance or personal interests. Presumably, these algorithms may be used in focused crawling, i.e., by setting the crawling priority of a page according to the score computed by HITS or DPPR. However, these algorithms each have deficiencies.
In the PageRank algorithm, a web page receives a higher rank if the web pages it is linked from have higher ranks. PPR is similar but in addition takes into account the page relevance. The rank computed by PPR indicates the relevance of a web page to a certain topic but it is not a good measure for the “connectness” of a web page to target pages. For example, a terminal page (a web page with no out-going links) may have a very high rank, but it does not lead to any other pages. In addition, PageRank and its variations calculate aggregated score. This is inappropriate for focused crawling. For example, consider two web pages A and B where web page A links to three target pages and three non-target pages, and web page B only links to three target pages. If the rank is calculated according to the PageRank model, web page A will receive a higher rank than web page B. However, from the perspective of crawling, web page B should be ranked higher as it is “purer” than A and leads to target pages.
In addition, PPR and DPPR are one-directional (from ancestors to offspring) score propagation algorithms. Hence, it is hard to identify the hub pages. However, hub pages are often very useful in focused crawling because hub pages are most likely to lead to target pages.
The HITS algorithm, on the other hand, is a two-directional (between ancestors and offspring) score propagation algorithm, and it can be used to identify both hubs and authorities on certain topics. Intuitively, hubs are the web pages that should be identified and explored in focused crawling. However, the HITS algorithm has a similar problem to PageRank in that it calculates aggregated scores. In addition, in the HITS algorithm, the target pages are used as the “seed” to form a sub-structure surrounding them, and the scores are only computed for those nodes in the sub-substructure. In focused crawling, a score should be computed for every page, often far away from target pages. Accordingly, the HITS algorithm does not work well in such a case.