Aggregate statistical data about the World Wide Web (WWW or Web) are very useful in numerous scenarios such as, for example, market research, intelligence gathering, and social studies. In many of these applications, one is interested not in generic data about the whole Web but rather in highly focused information pertinent to a specific domain or topic. Topical Web statistics are crucial for generating opinion polls about products, market intelligence, tracking social networks, etc. Furthermore, timely acquisition of this information provides a competitive advantage such that timely reporting of such statistics is a requirement. Focused statistical data can be gathered by a brute force crawl of the whole Web, or by a “focused crawl”, that collects mainly pages that are relevant to the topic of interest. Crawling, however, is an expensive enterprise requiring substantial resources.
One class of techniques for gathering topical statistical data about documents comprises focused crawling. One conventional focused crawling technique uses properties such as in-degree and anchor text keywords to guide a crawl towards relevant pages [Cho, J., et al., “Efficient Crawling Through URL Ordering”, Computer Networks and ISDN Systems, 30:161-172, 1998]. Another conventional focused crawling technique uses a semi-supervised learning process to identify on-topic pages [Chakrabarti, S., et, al., “Distributed Hypertext Resource Discovery Through Examples”, In Proceedings of the 25th International Conference on Very Large Databases (VLDB), pages 375-386, 1999; and Chakrabarti, S., et. al., “Focused Crawling: A New Approach to Topic-specific Web Resource Discovery”, In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 1623-1640, Toronto, Canada, 1999]. These conventional methods of focused crawling also introduced the notions of “hard-focus method” and “soft-focus method”, referring to two possible strategies to guide the crawl to further on-topic pages.
Yet another conventional focused crawling technique uses a sophisticated focused crawling process in which the “context” of a page is used to determine whether the page is a good gateway for discovering more pages about the topic [Diligenti, M., et. al., “Focused Crawling Using Context Graphs”, In Proceedings of 26th International Conference on Very Large Data Bases, Cairo, Egypt, 2000]. This context comprises the link-induced neighborhood of the page and of its content-based model. A further conventional focused crawling technique uses a reinforcement learning approach to crawling the Web [Rennie, J., et. al., “Using Reinforcement Learning to Spider the Web Efficiently”, In Proceedings of International Conference on Machine Learning, 1999]. Although this technology has proven to be useful, it would be desirable to present additional improvements. These conventional focused crawling techniques are aimed at fetching as many quality pages as possible that are relevant to the focus topic. However, they are not designed to generate a random sample of on-topic pages as efficiently as possible.
Another class of techniques for gathering statistical data about documents comprises sampling web pages, possibly through random walks. One conventional sampling method uses random queries to estimate the coverage and the overlap between search engines (Bharat, K., et. al., “A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines”, In Proceedings of the 7th International World Wide Web Conference (WWW7), pages 379-388, April 1998).
Another conventional sampling through random walks technique uses a random walk process [Henzinger, M., et. al., “Measuring Index Quality Using Random Walks on the Web”, In Proceedings of the 8th International World Wide Web Conference (WWW8), pages 213-235, May 1999], which converges to a distribution such as PageRank [Page, L., et. al., “The Pagerank Citation Ranking: Bringing Order to the Web”, Technical report, Computer Science Department, Stanford University, 1998; and Brin, S., et. al., “The Anatomy of a Large-scale Hypertextual Web Search Engine”, In Proceedings of the 7th International World Wide Web Conference (WWW1998), pages 107-117, Brisbane, Australia, 1998] over the nodes of the Web. This technique then modifies the random walk samples so as to approximate a nearly uniform distribution over the Web [Henzinger, M., et. al., “On Near-Uniform URL Sampling”, In Proceedings of the 9th International World Wide Web Conference (WWW9), pages 295-308, May 2000].
Yet another conventional sampling through random walks technique uses a random walk on an undirected and regular version of the Web graph as means of generating near-uniform samples of Web pages [Bar-Yossef, Z., et. al., “Approximating Aggregate Queries About Web pages via Random Walks”, In Proceedings of 26th International Conference on Very Large Data Bases, pages 535-544, Morgan Kaufmann, 2000]. A further conventional sampling and random walk technique handles both directed and undirected graphs [Rusmevichientong, P., et. al., “Methods for Sampling Pages Uniformly from the World Wide Web”, In Proceedings of AAAI Fall Symposium on Using Uncertainty Within Computation, Cape Cod, Mass., 2001].
Although these sampling and random walk technologies have proven to be useful, it would be desirable to present additional improvements. These conventional sampling and random walk techniques generate an unfocused sample of pages. They cannot be used to efficiently generate a focused sample. Choosing uniformly at random a sample of Web pages about a given topic can be carried out either by a full-fledged crawl or by a focused crawl, which guides a user towards on-topic pages. However, crawling is a formidable task even when focused, requiring significant investments in infrastructure, bandwidth, and software engineering. Moreover, crawlers and focused crawlers typically prioritize fetching pages with high quality and PageRank, and thus may not be suitable for generating a uniform, unbiased sample of pages.
One conventional method uses a topical sample of Web pages to discover the fraction of images on the Web that contain textual information [Kanungo, T., et. al., “What Fraction of Images on the Web Contain Text?”, In Proceedings of Web Document Analysis, 2001]. However, querying the search engine Google® generates the sample. Google® returns pages with a high PageRank; consequently, the returned pages do not have a uniform distribution. Moreover, the sample relies on the freshness of the repository maintained by Google®; this repository may not provide an updated snapshot of the Web. In general, performing a random walk that stays focused is a non-trivial task [Davison, B. D., “Topical Locality in the Web”, In Research and Development in Information Retrieval (SIGIR), pages 272-279, 2000; and Menczer, F., “Links Tell Us About Lexical and Semantic Web Content”, Technical Report cs.IR/0230004, Computer Science Department, Univ. of Iowa, 2001].
A further class of techniques for gathering statistical data about documents comprises data mining of the Web. One such technique uses a process for mining implicitly defined Web communities to search for small bipartite cores as signatures for Web communities [Kumar, R., “Trawling the Web for Emerging Cyber-communities”, In Proceedings of the 8th International World Wide Web Conference (WWW1999), pages 1481-1493, Toronto, Canada, 1999]. Another technique for data mining demonstrates that the same global structural properties of the Web graph appear also in its subgraphs; these subgraphs are specified by themes, topics, or geographical proximity [S. Dill, et. al. “Self-similarity in Web”, ACM Transactions on Internet Technology, 2:205-223, 2002].
In general, conventional techniques for gathering or aggregating statistical data about the Web are focused or based on random walks, but not both focused and based on random walks. Conventional techniques require an extended period of time to crawl the Web. Further, conventional techniques require many resources in terms of computational and communication infrastructure, bandwidth, and software engineering. What is therefore needed is a system, a service, a computer program product, and an associated method for efficiently performing a focused random walk through linked documents to generate statistics or identify samples with respect to a focus topic. The need for such a solution has heretofore remained unsatisfied.