Web-crawlers spider web sites in a methodical and automated way to analyze the web sites determining whether issues related to web vulnerabilities, accessibility, quality and a myriad of other purposes exist. Typically in web sites, the same web components or web information appear repeatedly across different pages of the site to facilitate site navigation. Crawling redundant components increases time and resources needed.
For example, a web crawler visits two web pages in which the pages have a common HTML form control. When the web crawler scans a second web page, the crawler detects the HTML form control was already scanned as part of a first page scan but skips the second page scan to avoid redundant processing only when the complete content of the web pages is similar.
A previous solution typically identifies two pages as the same when the pages are analyzed to be structurally similar. A similarity algorithm of the previous solution operates on a page level and assumes a repetitive consecutive sequence of HTML elements is redundant for analysis purposes. The technique can be applied in each sub-structure of a page, however the previous solution typically lacks scalability and efficiency. The previous solution generates an MD5 hash value as an identifier (ID) of a DOM or HTML elements. Accordingly a slightly different HTML can produce a completely different MD5 hash value and for each computed hash value of a page the crawler would need to search in a record repository comprising many records to determine whether a specific sub-tree or control was scanned previously.
In a similar solution, using similarity estimation, Gurmeet (Gurmeet S. Manku, Arvind Jain, Anish D. Sarma, (2007) “Detecting near duplicates for web crawling,” Proceedings of the 16th international conference on World Wide Web, pp: 141-150) proposed a method to use a Locality Sensitive Hash (LSH) [Charikar (Moses Charikar, Similarity estimation techniques from rounding algorithms. In Proceedings of 34th Symposium on Theory of Computing (STOC) (2002), 380-388)] to detect near duplicate web pages. Benjamin Van (Benjamin Van Durme and Ashwin Lall, Online Generation of Locality Sensitive Hash Signatures, Proceedings of the ACL 2010 Conference Short Papers, pages 231-235, Uppsala, Sweden, 11-16 Jul. 2010. © 2010 Association for Computational Linguistics) revisited the work of Ravichandran (Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering, Proceedings of the 43rd Annual Meeting of the ACL, pages 622-629, Ann Arbor, June 2005. © 2005 Association for Computational Linguistics) and Charikar (2002) in asserting that an online version of an LSH signature can be maintained. However, the work presented consisted of detecting complete content similarity (every character in an HTML page) of a web page. Other proposed similar solutions include those by Batkoa 2008 (“Scalability comparison of Peer-to-Peer similarity search structures” Michal Batkoa, David Novaka, Fabrizio Falchib, Pavel Zezulaa, Journal Future Generation Computer Systems archive Volume 24 Issue 8, October, 2008) and S. Asaduzzaman 2009 (A locality preserving routing overlay using geographic coordinates (S. Asaduzzaman and G. v. Bochmann) IEEE Intern. Conf on Internet Multimedia Systems Architecture and Application, Bangalore, India, December 2009).