The evolution of computers and networking technologies from high-cost, low performance data processing systems to low cost, high-performance communication, problem solving and entertainment systems has provided a cost-effective and time saving means to lessen the burden of performing every day tasks such as correspondence, bill paying, shopping, budgeting and information gathering. For example, a computing system interfaced to the Internet, via wire or wireless technology, can provide a user with a channel for nearly instantaneous access to a wealth of information from a repository of web sites and servers located around the world, at the user's fingertips.
Typically, the information available via web sites and servers is accessed via a web browser executing on a web client (e.g., a computer). For example, a web user can deploy a web browser and access a web site by entering the web site Uniform Resource Locator (URL) (e.g., a web address and/or an Internet address) into an address bar of the web browser and pressing the enter key on a keyboard or clicking a “go” button with a mouse. The URL typically includes four pieces of information that facilitate access: a protocol (a language for computers to communicate with each other) that indicates a set of rules and standards for the exchange of information, a location to the web site, a name of an organization that maintains the web site, and a suffix (e.g., com, org, net, gov and edu) that identifies the type of organization.
In some instances, the user knows, a priori, the name of the site or server, and/or the URL to the site or server that the user desires to access. In such situations, the user can access the site, as described above, via entering the URL in the address bar and connecting to the site. However, in most instances, the user does not know the URL or the site name. Instead, the user employs a search engine to facilitate locating a site based on keywords provided by the user. In general, the search engine is comprised of executable applications or programs that search the contents of web sites and servers for keywords, and return a list of links to web sites and servers where the keywords are found. Basically, the search engine incorporates a web “crawler” (aka, a “spider” or a “robot”) that retrieves as many documents as possible as their associated URL. This information is then stored such that an indexer can manipulate the retrieved data. The indexer reads the documents, and creates a prioritized index based on the keywords contained in each document and other attributes of the document. Respective search engines generally employ a proprietary algorithm to create indices such that meaningful results are returned for a query.
Thus, a web crawler is crucial to the operation of search engines. In order to provide current and up-to-date search results, the crawler must constantly search the web to find new web pages, to update old web page information, and to remove deleted pages. The number of web pages found on the Internet is astronomical. It therefore requires that a web crawler be extremely fast. Since most web crawlers gather their data by polling servers that provide the web pages, a crawler must also be as unobtrusive as possible when accessing a particular server. Otherwise, the crawler can absorb all of the server's resources very quickly and cause the server to shut down. Generally, a crawler identifies itself to a server and seeks permission before accessing a server's web pages. At this point, a server can deny access to an abusive crawler that steals all of the server's resources. A web page hosting server typically benefits from search engines, because they allow users to find their web pages more easily. Thus, most servers welcome crawlers, as long as they do not drain all of the server's resources, so that the server's contents can be better exploited by users.
One of the downsides to a crawler identifying itself to a server is that the server can then “spoof” the crawler. Servers usually have protected areas that they do not want to have exposed to the general Internet. When a crawler identifies itself, it is also told what areas it cannot access. If the crawler wants to maintain a working relationship with that particular server, it abides by the server's requests. However, if a server desires to spoof or disguise its true contents, it can refer the crawler to an area of pages that mimic true URLs of that server but that contain “alternate” contents. Thus, a server that normally provides information only about cats can set up its URLs with information about dogs in a section that only web crawlers access. This is done so that when a user searches for “dogs,” the server's web pages about cats will be shown by the search engine. Typically, spoofing is utilized when a server's content is deemed objectionable by society, but the server desires to proliferate its contents beyond its normal “keywords.” In this manner, objectionable material can be returned in a search engine list by using common words such as flowers, dogs, cats, weather, etc. Spoofing diminishes the accuracy and also the reputation of search engines utilizing the spoofed web crawler data.