A website crawler is a tool that performs an automatic exploration of a website. The task of exploration is beneficial for many applications including simple information indexing tasks, as well as a more complex compliance testing.
One challenge automated tools face is capability to understand whether two or more universal resource locator (URL) links on a page perform equivalent actions. A determination is important because websites, for example, comprising news, blogs, on-line stores, and emails, have a massive quantity of URL links typically providing a similar type of navigation action, bringing a user to equivalent pages. In practice the link equivalency collapses into a single news link, a single blog entry, a single item in the store, a single email, respectively. A common term for these links is equivalent links.
Exploring all possible equivalent links of a website is a time consuming task that is not required in all cases. For example, when performing a security scan, a web crawler is more concerned in identifying a structure of a webpage, than in the text content. Using this example, exploring just one equivalent link would be sufficient, and the results could be generalized for the remaining instances.
In addition to the initial identification problem, most websites on subsequent visits change the set of equivalent links displayed to the user. Accordingly, a news letter will show the latest news, a blog will show the latest blogs, an on-line store will probably show the items on sale, to name a few. The crawling of such websites is thus further complicated because the container page containing all the equivalent links is typically never the same, therefore a crawler is not be able to know the web page was a previously visited web page.
Current solutions to the problem typically require a web crawler to examine the page content returned by each link to determine whether the links are equivalent. The web crawler uses heuristics to omit portions of the page that will commonly differ between similar pages, for example, advertisements, but this practice leads to inaccurate results where either too much information or too little information is omitted. Improvements to this technique require a user to create hypertext markup language (HTML) expressions to indicate which portions of pages to omit when comparing the pages to determine similarity.
In addition, existing techniques use the same page structure comparisons to determine whether the structure of the webpage stays the same during subsequent visits and discard the page after a period of time. This technique indirectly solves the problem of equivalent links, because the web crawler works with the structure of the page, rather than the attribute values of the page. Other solutions require expert knowledge in configuring the crawler to ignore certain portions of the URL.