A web crawler is a program or automated script which browses the World Wide Web (WWW) in an automated manner. Web crawlers are used for various purposes, the most popular of which is to identify and index as much available information on the World Wide Web as possible and provide the information to the general public through a search engine, such as the search engine provided by Yahoo!™. One way to browse the WWW is to begin at one or more webpages and follow the hyperlinks (referred to herein as “links”) contained within each webpage by loading the webpages corresponding to the links. The links are uniform resource locators (URLs) that allow the webpage, identified by the link, to be accessed. A webpage may be accessed through one or more URLs but one URL can be used to access at most one webpage.
One way for web crawlers to retrieve as much information as possible is to only “crawl” webpages that provide unique content, i.e., content relating to a website that has not already been indexed and/or up-dated. However, some web crawlers assume that a unique URL corresponds to a unique webpage. This is not always the case.
A dynamic URL is a URL of a Web page with content that depends on variable parameters that are provided to the server that delivers the content. The parameters may be already present in the URL itself or they may be the result of user input.
A dynamic URL typically results from search of a database-driven website or the URL of a website that runs a script. In contrast to static URLs, in which the contents of the webpage do not change unless the changes are coded into the HTML, dynamic URLs are typically generated from specific queries to a website's database. The webpage has some fixed content and some part of the webpage is a template to display the results of the query, where the content comes from the database that is associated with the website. This results in the page changing based on the data retrieved from the database per the dynamic parameter. Dynamic URLs often contain the following characters: ?, &, %, +, =, $, cgi. An example of a dynamic URL may be something like the following: http://www.amazon.com/store?prod=camera&brand=sony&sessionid=7ek138dje72931d91ds.
However, sometimes a parameter in a dynamic URL may not result in modifying the page content in any way. One of the parameters of the example dynamic URL above is sessionid followed by a corresponding value that is unique to a user. The “sessionid” parameter is used by the website to track the user during a particular session in order to tailor the user's experience based on knowledge obtained about what actions the user has made during the session. The “sessionid” may be inserted into the URL as a result from a user registering and logging into the website. Another parameter similar to sessionid parameter is the source tracker parameter. Like the sessionid parameter, the source tracker parameter has no effect on the content of webpage; it is only used for logging traffic sources to the webpage.
To current web crawlers the above dynamic URL is different than http://www.amazon.com/store?prod=camera&brand=sony&sessionid=2k4gd03k9sx1zc8d. However, the only difference between the URLs is the sessionid. The content provided on each corresponding webpage is identical (i.e., information about Sony cameras) except for perhaps a graphic or text that identifies the user. For the purposes of search results, the two webpages are referred to as “duplicates.”
Another example of duplicate webpages is a webpage that displays contents of a table and another webpage that sorts the contents of the table differently, according to certain criteria. There may be multiple criteria in which the contents of the table may be sorted. Although the contents of each webpage are displayed differently and the URLs are different, the overall content of each webpage is substantially identical. Thus, there may be hundreds of duplicate webpages that each provide the same particular content. A web crawler may unintentionally index all such duplicates.
One approach for a web crawler is to intelligently analyze a particular webpage and compare the particular webpage against other webpages to determine whether the content of the particular webpage is truly unique. However, such an approach is still prone to error (i.e., not all duplicates are identified as duplicates). Furthermore, a significant amount of resources are consumed by simply accessing the webpages, much less performing the comparisons. By wasting time accessing multiple webpages of a website, that time may not be used accessing other valid, non-duplicate webpages.
To illustrate, a web crawler is limited to a certain number of fetches (e.g., HTTP GET requests) that the web crawler can make within a given period of time. A web server that hosts a website is also limited in the number of fetches it can handle. Web crawlers cannot accept every URL of every single website because some websites have millions of webpages with their own corresponding URLs. Thus, a web crawler should make intelligent decisions about which webpages to access.
Another approach for a web crawler is to implement strict rules to handle dynamic URLs in order to avoid accessing duplicates. For example, a web crawler may only access a small number of webpages with “similar looking” URLs. As another example, a web crawler may not access URLs that are greater than a certain number of characters in order to avoid URLs with session identifiers. However, such measures prevent web crawlers from accessing a significant amount of unique content.
Another approach for handling dynamic URLs is for webmasters to modify their respective websites to avoid dynamic URLs or to rewrite dynamic URLs to make them appear static so that Web crawlers will crawl the entirety of their respective websites. Webmasters of websites typically desire lots of user traffic to their respective websites in order to generate advertisement revenue. Accordingly, webmasters want web crawlers to crawl all relevant webpages on their respective websites. However, because of web crawler difficulties in handling dynamic URLs, webmasters must spend a considerable amount of time modifying their respective websites.
Therefore, there is a need to more efficiently handle dynamic URLs in order to avoid unnecessarily accessing duplicate webpages.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.