The evolution of computers and networking technologies from high-cost, low-performance data processing systems to low cost, high-performance communication, problem solving and entertainment systems has provided a cost-effective and time saving means to lessen the burden of performing every day tasks such as correspondence, bill paying, shopping, budgeting and information gathering. For example, a computing system interfaced to the Internet, via wire or wireless technology, can provide a user with a channel for nearly instantaneous access to a wealth of information from a repository of web sites and servers located around the world, at the user's fingertips.
Typically, the information available via web sites and servers is accessed via a web browser executing on a web client (e.g., a computer). For example, a web user can deploy a web browser and access a web site by entering the web site Uniform Resource Locator (URL) (e.g., a web address and/or an Internet address) into an address bar of the web browser and pressing the enter key on a keyboard or clicking a “go” button with a mouse. The URL typically includes four pieces of information that facilitate access: a protocol (a language for computers to communicate with each other) that indicates a set of rules and standards for the exchange of information, a location to the web site, a name of an organization that maintains the web site, and a suffix (e.g., corn, org, net, gov, and edu) that identifies the type of organization.
In some instances, the user knows, a priori, the name of the site or server, and/or the URL to the site or server that the user desires to access. In such situations, the user can access the site, as described above, via entering the URL in the address bar and connecting to the site. However, in many instances, the user does not know the URL or the site name. Instead, the user employs a search engine to facilitate locating a site based on keywords provided by the user. In general, the search engine is comprised of executable applications or programs that search the contents of web sites and servers for keywords, and return a list of links to web sites and servers where the keywords are found. Basically, the search engine incorporates a web “crawler” (aka, a “spider” or a “robot”) that retrieves as many documents as possible (e.g., via retrieving URLs associated with the documents). This information is then stored such that an indexer can manipulate the retrieved data. The indexer reads the documents, and creates a prioritized index based on the keywords contained in each document and other attributes of the document. Respective search engines generally employ a proprietary algorithm to create indices such that meaningful results are returned for a query.
Thus, a web-crawler is crucial to the operation of search engines. In order to provide current and up-to-date search results, the crawler must constantly search the web to find new web pages, to update old web page information, and to remove deleted pages. The number of web pages found on the Internet is astronomical. It therefore requires that a web-crawler be extremely fast. Since most web-crawlers gather their data by polling servers that provide the web pages, a crawler must also be as unobtrusive as possible when accessing a particular server. In the extreme, the crawler can absorb all of the server's resources very quickly and cause the server to shut down. Generally, a crawler identifies itself to a server and seeks permission before accessing a server's web pages. At this point, a server can deny access to an abusive crawler that steals all of the server's resources. A web page hosting server typically benefits from search engines, because they allow users to find their web pages more easily. Thus, most servers welcome crawlers, as long as they do not drain too much of the server's resources, which can detrimentally impede a users' ability to exploit server contents.
The sheer volume of information on the Internet today presents a seemingly insurmountable obstacle to efficient web-crawling. For example, a typical web-crawler attempting to catalogue every page on the Internet can take weeks or even months to plod through them. A page that is updated a moment after it has been crawled might not be recrawled for months, in which case the information associated with the page is not accurately catalogued, which in turn reduces the efficiency with which a user can receive information relevant to a search. Thus, there is an unmet need in the art for systems and methods that improve web-crawling speed and efficiency.