A. Field of the Invention
The present invention relates generally to content retrieval on the world wide web, and more particularly, to automated web crawling.
B. Description of the Related Art
The World Wide Web (“web”) contains a vast amount of information. Search engines assist users in locating desired portions of this information by cataloging web pages. Typically, in response to a user's request, the search engine returns references to documents relevant to the request.
Search engines may base their determination of the user's interest on search terms (called a search query) entered by the user. The goal of the search engine is to identify links to high quality relevant results based on the search query. Typically, the search engine accomplishes this by matching the terms in the search query to a corpus of pre-stored web documents. Web documents that contain the user's search terms are considered “hits” and are returned to the user.
The corpus of pre-stored web documents may be stored by the search engine as an index of terms found in the web pages. Documents that are to be added to the index may be automatically located by a program, sometimes referred to as a “spider,” that automatically traverses (“crawls”) web documents based on the uniform resource locators (URLs) contained in the web documents. Thus, for example, a spider program may, starting at a given web page, download the page, index the page, and gather all the URLs present in the page. The spider program may then repeat this process for the web pages referred to by the URLs. In this way, the spider program “crawls” the world wide web based on its link structure.
Some web sites track users as they download different pages on the web site. User tracking is useful for identifying user behavior, such as identifying purchasing behavior by tracking the user through various web site page requests on a shopping orientated web site.
Two methods are commonly used to track user behavior: use of cookies to maintain information and embedding session identifiers in the uniform resource locators (URLs) in the web pages presented to the user. An embedded session identifier, in particular, may include a string of random characters embedded in the URLs returned to the user. Specifically, when the user requests a page of a web site with a URL that does not have a session identifier, a session identifier is created for this user and the user receives a version of the entry web page in which links on the page are annotated by the session identifier. When the user selects a link, the web server parses the session identifier from the URL, attaches the same session identifier to the local links on the next generated web page, and returns that web page to the user. The web server continues to parse and attach the session identifiers as long as the user requests a page whose URL has a session identifier.
As an example of the use of session identifiers, consider the situation of a web spider crawling a first web site that contains multiple URLs that do not include session identifiers. The spider may decide to crawl these URLs, which may point to a second web site, one after the other. For each URL it requests, the spider may return a page whose URLs are annotated with a session identifier. Each requested page, however, may include a different session identifier. The spider would then extract these annotated URLs from the pages and if two URLs are identical except for the session identifier the spider would not recognize this since the URL strings are different. The spider would thus repeatedly crawl the same web pages, thus wasting the spider's time and bandwidth and filling the search engine's index with duplicate pages, thus wasting storage space.
Thus, there is a need in the art to effectively identify web sits that contain session identifiers in order to improve web crawling.