This invention relates generally to processing within a computing environment, and more particularly to Document Object Model (DOM) based page uniqueness detection.
Web Crawlers, such as those used by page indexing search engines, and security scanning applications often need to determine if a page has already been visited. To do this, those applications attempt to identify a page as unique using information on the page. This information is used to determine if the next page being visited is a new page or a duplicate of one visited previously. Web Crawlers and security scanning applications must use techniques in order to prevent them from entering an infinite loop (i.e. exploring a series of pages over and over again) while ensuring that the relevant pages of the website are indexed. These applications may use key elements of the page in order to determine the uniqueness of the page. They may use, for example, the Uniform Resource Locator (URL) of the page, the parameters passed to the page and cookies (i.e. information stored on a browser from a web server), etc. in order to uniquely identify the page. This process will assist in avoiding creating an infinite loop. One problem with this type of implementation is that it often makes it impossible to crawl Web 2.0 applications. Web 2.0 applications make extensive use of JavaScript and XmlHttpRequest which may modify page content without changing the URL, parameters, or cookies of the page thereby making identifying a page more difficult.