A fundamental problem Web crawlers need to solve when crawling websites using WEB 1.0 and/or WEB 2.0 technologies may be unique identification of web pages and respective state of the web pages. This may be fundamental to a successful crawl, because without unique identification of web pages and respective state of the web pages the crawl may not stop. The difficulty of this task is typically amplified by WEB 2.0 technologies in which rich Internet application (RIA) websites that may have dynamic content that may change over time. In these sites, a Uniform Resource Locator (URL) may no longer synchronize with content of the page as in WEB 1.0 (for example, the URL may not necessarily change when the content of the page changes).
The problem may be amplified for web pages with content that changes over time without involving user actions. In these pages, provided logic may dictate to a website construction of portions of the content. Examples may include embedded advertisements, time display, counters of page visits over time, and others. The additional, changing data may impede an ability of an automatic crawler to identify the web page (in WEB 1.0) and the document object model states (in RIA applications), because the page or DOM may continually change.
Regardless of the web technology used, the web page at a moment in time may consist of a DOM. Crawlers may use various equivalence functions to infer whether two document object models are considered equal. The main challenge when defining equivalence functions may be to exclude the portion of the page/DOM that may introduce false negatives from the content considered using the equivalence function.
A typical current solution manually configures a crawler on a case-by-case basis. Manual configuration may force the crawler to ignore certain types of objects known to change over time, such as session identifiers and cookies. Manual configuration is typically highly inefficient, and inaccurate, because the list is typically incomplete. Using another solution, regular expressions identify in the DOM portions of content that can be ignored. The main problem with the latter solution is typically a difficulty in creating regular expressions and creating the regular expressions that differ for different sites.