A. Field of the Invention
The present invention relates generally to content retrieval on the world wide web, and more particularly, to automated web crawling.
B. Description of the Related Art
The World Wide Web (“web”) contains a vast amount of information. Search engines assist users in locating desired portions of this information by cataloging web pages. Typically, in response to a user's request, the search engine returns references to documents relevant to the request.
Search engines may base their determination of the user's interest on search terms (called a search query) entered by the user. The goal of the search engine is to identify links to high quality relevant results based on the search query. Typically, the search engine accomplishes this by matching the terms in the search query to a corpus of pre-stored web documents. Web documents that contain the user's search terms are considered “hits” and are returned to the user.
The corpus of pre-stored web documents may be stored by the search engine as an index of terms found in the web pages. Documents that are to be added to the index may be automatically located by a program, sometimes referred to as a “spider,” that automatically traverses (“crawls”) web documents based on the uniform resource locators (URLs) contained in the web documents. Thus, for example, a spider program may, starting at a given web page, download the page, index the page, and gather all the URLs present in the page. The spider program may then repeat this process for the web pages referred to by the URLs. In this way, the spider program “crawls” the world wide web based on its link structure.
Some web sites track users as they download different pages on the web site. User tracking is useful for identifying user behavior, such as identifying purchasing behavior by tracking the user through various web site page requests on a shopping orientated web site.
Two methods are commonly used to track user behavior: use of cookies to maintain information and embedding session identifiers in the uniform resource locators (URLs) in the web pages presented to the user. An embedded session identifier, in particular, may include a string of random characters embedded in the URLs returned to the user. When the user selects one of the URLs, the embedded identifier is returned to the web server in the request for the web page. The identifier can then be used to track the web pages presented to the user.
Embedding session identifiers in a web page, although potentially useful to the web site owner, poses problems to automated web spiders. Because the spider identifies pages based on their URLs, embedding session identifiers in a URL can cause the underlying web page to appear to be different to the spider each time a new session identifier is embedded in the URL. This can, in turn, cause the spider to repeatedly crawl the same web page, thus limiting the spiders ability to crawl all possible sites.
Thus, there is a need in the art to more effectively crawl web sites that embed session identifiers in URLs.