1. Field of the Invention
The present invention generally relates to techniques for systematically locating and monitoring information on the Internet, and in particular to a genre of such techniques known as “web crawlers.”
2. Background Description
Web Crawlers are programs used to find, explore, and monitor content on the World Wide Web (WWW). They are the primary methods used by most data-mining applications such as search engines to discover and monitor WWW content. Due to the distributed nature of the WWW, crawling currently represents the best method for understanding how content on the WWW changes.
The WWW is a large connected graph of HyperText Markup Language (HTML) pages distributed over many computers connected via a network. The pages are connected and accessed by Universal Resource Locators (URLs). These URLs are addresses to the HTML pages.
A crawler is seeded with a set of URLs. These URLs are placed in a queue. For each of the URLs, the program downloads the page. It then extracts the external URLs referenced on that page, before proceeding to the page of the next URL in the queue. Each of the URLs extracted is then added at the end of the queue with the other URLs the crawler was seeded with. This process repeats indefinitely. The URLs collected and queued in this fashion form a WWW graph, wherein each URL is linked to a seed URL, or to another URL on whose page the URL was found, and to those other URL's referenced on the URL's page.
The foregoing crawling algorithm describes a breadth-first explanation of the WWW graph. Other methods of exploring content of the WWW may use depth-first searches or hybrid solutions.
The problem with current crawlers is the fact that they have finite resources and can get into infinite loops traversing the changing WWW graph. By following one URL, that URL can bring up a page with other URLs, and so on and so forth. Because these pages and URLs can be generated dynamically (“dynamic content”) at the time of the request, a crawler can be faced with exploring an infinite graph.
When users or web crawlers make a request for a web page via its URL, the request is sent to a web server responsible for returning the HTML page requested. In the early days of the WWW, these web pages were stored as files on the permanent storage of the web server. The web server was simply a “file server”. There was a 1 to 1 mapping between a URL and a specific web page. Since those early days web servers do not necessarily simply serve back stored files. Many times the file is generated “on the fly” based on a number of parameters (URL with parameter string, cookies, time-of-day, user info, information in a database, prior history, etc.). These parameters are infinite in their variety and values. When pages are created in this manner, they are commonly referred to as “dynamic content,” as opposed to the early “static content” that was simply non-changing web files.