The increased use of content-management systems to generate web pages has significantly enriched the browsing experience of end users. The multitude of site navigation links, sidebars, copyright notices, and timestamps provide easy-to-access and often useful information to the users. From an objective standpoint, however, these “template” structures pollute the content by digressing from the main topic of discourse of the web page. Modern search engines may only require content of web pages without such template structures for indexing, analysis and ranking of web pages for user search queries. Furthermore, template structures can cripple the performance of many modules of search engines, including the index function, ranking function, summarization function, duplicate detection function, etc. With templated content currently constituting more than half of all HTML on the web and growing steadily (see for example, Z. Bar-Yossef and S. Rajagopalan, Template Detection via Data Mining and its Applications, In Proc. 11th WWW, pages 580-591, 2002; and D. Gibson, K. Punera, and A. Tomkins, The Volume and Evolution of Web Page Templates, In Proc. 14th WWW (Special Interest Tracks and Posters), pages 830-839), it is imperative that search engines develop scalable tools and techniques to reliably detect templates on a web page.
Existing methods for template detection operate on a per web site basis by analyzing several web pages from the site and identifying content and/or structure that repeats across many pages. The problem of template detection and removal was first studied by Bar-Yossef and Rajagopalan (see Z. Bar-Yossef and S. Rajagopalan, Template Detection via Data Mining and its Applications, In Proc. 11th WWW, pages 580-591, 2002), who proposed performing site-level template detection based on segmentation of the DOM tree, followed by the selection of certain segments as candidate templates depending on their content. Yi et al. (see L. Yi, B. Liu, and X. Li, Eliminating Noisy Information in Web Pages for Data Mining. In Proc. 9th KDD, pages 296-305, 2003) and Yi and Liu (see L. Yi and B. Liu, Web Page Cleaning for Web Mining through Feature Weighting, In Proc. 18th IJCAI, pages 43-50, 2003) used a data structure called the style tree to take into account the metadata for each node, instead of its content. Vieira et al. (see K. Vieira, A. Silva, N. Pinto, E. Moura, J. Cavalcanti, and J. Freire, A Fast and Robust Method for Web Page Template Detection and Removal, In Proc. 15th CIKM, pages 256-267, 2006) proposed performing site-level template detection by mapping identical nodes and subtrees in the DOM trees of two different pages. They proposed performing the expensive task of template detection on a small number of pages, and then removing all instances of these templates from the entire site by a much cheaper approach.
While these “site-level” template detection methods offer a lot of promise, such methods are of limited use because of the following two reasons. First, site-level templates constitute only a small fraction of all templates on the web. For instance, page-and session-specific navigation aids such as “Also bought” lists, ads, etc. are not captured by the site-level notion of templates. Second, these methods are error prone when the number of pages analyzed from a site is statistically insignificant, either because the site is small, or because a large fraction of the site is yet to be crawled. In particular, they are totally inapplicable when pages from a new website are encountered for the first time.
Additionally, some page-level algorithms have also been proposed recently that may operate only on segments of a web page. For example, Kao et al. (see H.-Y. Kao, J.-M. Ho, and M.-S. Chen, WISDOM: Web Intrapage Informative Structure Mining Based on Document Object Model, TKDE, 17(5):614-627, 2005) segment a given webpage using a greedy algorithm operating on features derived from the page. To do so, they use both page-level and site-level features such as the number of links between pages on a web-site. Debnath et al. (see S. Debnath, P. Mitra, N. Pal, and C. L. Giles, Automatic Identification of Informative Sections of Web Pages, TKDE, 17(9):1233-1246, 2005) also propose a page-level algorithm (“L-Extractor”) that applies a classifier to DOM nodes, but only certain nodes are chosen for classification, based on a predefined set of tags. Kao et al. (see H.-Y. Kao, M.-S. Chen, S.-H. Lin, and J.-M. Ho, Entropy-based link analysis for mining web informative structures, In Proc. 11th CIKM, pages 574-581, 2002) propose a scheme based on information entropy to focus on the links and pages that are most information-rich, reducing the weights of template material as a by-product. Song et al. (see R. Song, H. Liu, J.-R. Wen, and W.-Y. Ma, Learning Block Importance Models for Web Pages, In Proc. 13th WWW, pages 203-211, 2004) use visual layout features of the webpage to segment it into blocks which are then judged on their salience and quality. Other local algorithms based on machine learning have been proposed to remove certain types of template material. Davison (see B. Davison, Recognizing Nepotistic Links on the Web, In AAAI-2000 Workshop on Artificial Intelligence for Web Search, pages 23-28, 2000) uses decision tree learning to detect and remove “nepotistic” links, and Kushmerick (see N. Kushmerick, Learning to Remove Internet Advertisement, In Proc. 3rd Agents, pages 175-181, 1999.) develops a browsing assistant that learns to automatically removes banner advertisements from pages.
Unfortunately, only segments of a web page may be operated upon by these algorithms, and the segments are chosen prior to any determination of the templateness of those segments. As a result, these algorithms would not be able to detect a segment that may be itself composed of several template and non-template nodes.
What is needed is a system and method that does not need multiple pages from the same website to perform template detection and that may perform template detection for any subset of a web page. Such a system and method should be easily deployed as a drop-in module in an existing web crawler work flow.