The evolution of computers and networking technologies from high-cost, low performance data processing systems to low cost, high-performance communication, problem solving and entertainment systems has provided a cost-effective and time saving means to lessen the burden of performing every day tasks such as correspondence, bill paying, shopping, budgeting and information gathering. For example, a computing system interfaced to the Internet, via wire or wireless technology, can provide a user with a channel for nearly instantaneous access to a wealth of information from a repository of web sites and servers located around the world, at the user's fingertips.
Typically, the information available via web sites and servers is accessed via a web browser executing on a web client (e.g., a computer). For example, a web user can deploy a web browser and access a web site by entering the web site Uniform Resource Locator (URL) (e.g., a web address and/or an Internet address) into an address bar of the web browser and pressing the enter key on a keyboard or clicking a “go” button with a mouse. The URL typically includes four pieces of information that facilitate access: a protocol (a language for computers to communicate with each other) that indicates a set of rules and standards for the exchange of information, a location to the web site, a name of an organization that maintains the web site, and a suffix (e.g., com, org, net, gov and edu) that identifies the type of organization.
In some instances, the user knows, a priori, the name of the site or server, and/or the URL to the site or server that the user desires to access. In such situations, the user can access the site, as described above, via entering the URL in the address bar and connecting to the site. However, in most instances, the user does not know the URL or the site name. Instead, the user employs a search engine to facilitate locating a site based on keywords provided by the user. In general, the search engine is comprised of executable applications or programs that search the contents of web sites and servers for keywords, and return a list of links to web sites and servers where the keywords are found. Basically, the search engine incorporates a web “crawler” (aka, a “spider” or a “robot”) that retrieves as many documents as possible at their associated URL. This information is then stored such that an indexer can manipulate the retrieved data. The indexer reads the documents, and builds an inverted index based on words. Respective search engines generally employ a proprietary algorithm to create indices such that meaningful results are returned for a query.
Thus, a web crawler is crucial to the operation of search engines. In order to provide current and up-to-date search results, the crawler must constantly search the web to find new web pages, to update old web page information, and to remove deleted pages. The number of web pages found on the Internet is astronomical. It therefore requires that a web crawler be extremely fast. Since most web crawlers gather their data by polling servers that provide the web pages, a crawler must also be as unobtrusive as possible when accessing a particular server. Otherwise, the crawler can absorb all of the server's resources very quickly and cause the server to shut down. Generally, a crawler identifies itself to a server and seeks permission before accessing a server's web pages. At this point, a server can deny access to an abusive crawler that steals all of the server's resources. A web page hosting server typically benefits from search engines, because they allow users to find their web pages more easily. Thus, most servers welcome crawlers, as long as they do not drain all of the server's resources, so that the server's contents can be better exploited by users.