The world-wide-web is a rich source of information. Today, there are estimated to be over one trillion unique web pages. Many of these pages are dynamically created (e.g., the home page of the New York Times), and have links to embedded content such as images and videos. To fully index these web pages, they must be rendered as they would be by a web browser, i.e., they must be rendered as they exist when they are first created and served. While it is relatively straightforward for a web browser to render a single web page or a small number of web pages in real time (i.e., as they are created), it is much more difficult for a web page indexing process to render a large number of pages such as all of the pages on the world wide web (1 trillion pages) or even just the top 1% of pages on the world wide web (10 billion pages) in real time.
To completely render a received web page, the content of all of the external resources that may be embedded in the web page must first be obtained. Such resources may include, but are not limited to, external images, Javascript code, and style sheets. Often, the same external resource is embedded in many different web pages. For example, the Urchin Javascript code, available from Google, Inc., is embedded in tens of millions of different web pages. Whenever any one of these web pages is rendered, the Urchin Javascript code is downloaded from a Google server. While it is efficient for a single user's web browser to request an external web page resource such as the Urchin Javascript code in real time (i.e., when the page in which the resource is embedded is rendered), it is neither feasible nor efficient for the rendering engine of a web page image indexing process to do so. The rendering engine of a web page image indexing process is designed to render a large number of web pages at a time, and to continually render a large number of web pages at a time in order to build a large index or repository of imaged web pages. If such a rendering engine attempted to render thousands or tens of thousands of web pages that embed the same external resource at the same time or close together in time, the server on which the external resource resides would be flooded with near simultaneous requests for the same object. To avoid such problems, the rendering engine of a web page image indexing process should ideally crawl each embedded resource exactly once, regardless of how many web pages embed the resource, and should render web pages in a way that does not require the external resources to be gathered in real time.