1. Field
Aspects disclosed herein relate to information searching, and more particularly to systems and methods relating to detecting Internet spiders and web crawlers.
2. Related Art
Internet users are increasingly finding navigating document collections to be difficult because of the increasing size of such collections. Likewise, companies, individuals and other organizations wishing to be found by Internet users face growing challenges with maintaining their online visibility. For example, it is estimated that the World Wide Web on the Internet includes more than 11 billion pages in the publicly indexable Web across more than 110 million web sites. Consequently, finding desired information in such a large collection, unless the identity, location, or characteristics of a specific document or search target are well known, can be much like looking for a needle in a haystack. The World Wide Web is a loosely interlinked collection of documents (mostly text and images) located on servers distributed over the Internet. Generally speaking, each document has an address, or Uniform Resource Locator (URL), in the exemplary form “http://www.server.net/directory/file.html”. In that notation, the “http:” specifies the protocol by which the document is to be delivered, in this case the “HyperText Transport Protocol.” The “www.server.net” specifies the name of a computer, or server, on which the document resides; “directory” refers to a directory or folder on the server in which the document resides; and “file.html” specifies the name of the file. URLs can be extremely long, complex strings of machine readable code.
Many documents on the Web are in markup language (e.g., HTML), which allows for formatting to be applied to the document, external content (such as images and other multimedia data types) to be introduced within the document, and “hotlinks” or “links” to other documents to be placed within the document, among other things. “Hotlinking” allows a user to navigate between documents on the Web simply by selecting an item of interest within a page. For example, a Web page about reprographic technology might have a hotlink to the Xerox corporate web site. By selecting the hotlink (often by clicking a marked word, image, or area with a pointing device, such as a mouse), the user's Web browser is instructed to follow the hotlink (usually via a URL, frequently invisible to the user, associated with the hotlink) and read a different document. A user cannot be expected to know or remember a URL for each and every document on the Internet, or even URLs for those documents in a smaller collection of preferred documents.
Accordingly, navigation assistance is not only helpful, but important for practical internet usage. Such navigation assistance is typically providing via an Internet based search engine, such as Google®, Microsoft's Bing®, Yahoo!® or the like. Accordingly, when an Internet user desires to find information about a company, individual or organization, the Internet user will frequently turn to a “search engine” to locate the information. A search engine serves as an index into the content stored on the Internet.
“Google” (www.google.com) is an example of a search engine. It operates in a similar manner to traditional keyword-based search engines, in that a search begins by the user's entry of one or more search terms used in a pattern-matching analysis of documents on the Web. It differs from traditional keyword-based search engines in that search results are ranked based on a metric of page “importance,” which differs from the number of occurrences of the desired search terms (and simple variations upon that theme). Regardless of the proprietary nature of any given search engines' approach, Internet users searching for companies, individuals or organizations with similar characteristics (i.e. name, industry, etc.) often receive search results that are inaccurate, or relate to entities other than the intended search target. For example, a search for “John Smith”, with the intention of obtaining information about a particular person by that name, will return many results about different “John Smith's, such that the desired John Smith may not have any relevant results. In these instances, the Internet user may build more complex search queries to generate more relevant results, which is only possible if the Internet user possesses information that can be used as a basis for such queries.
It is understood and well documented that it is desirable for companies, individuals and organizations to appear early in search results for personal, financial and other reasons. Prominence in search results for a given term or terms in search engines is a form of third party validation, at least in that Internet users place a higher value on entries in top search results because of their perceived relevance, success, and size. Therefore, viewership of search results or click throughs for search results on subsequent results pages declines precipitously.
Search Engine Optimization (SEO) has emerged as a category of services available to operators of web sites. SEO provides for deliberately engineering prominent placement in search results by tailoring web sites to the algorithms employed by a given search engine. In addition to SEO, ‘paid search’ may be utilized to display an advertisement on the top pages of search results for a given search term(s). SEO, paid search and other optimization strategies are typically only engaged by organizations due to their complexity and cost. Individuals have fewer options to achieve optimal placement in search results.
Google Profiles is one example of a mechanism individuals can utilize to offer information specific to themselves. Google Profiles does not influence search results, however, and individuals with even slightly common names often find themselves in a long list with others, eliminating the value of the feature. SEO, paid search, Google Profiles, and other similar optimization strategies are reactive in that they only influence but do not control what is returned in search results. These strategies are necessary because the natural search behavior of Internet users favors implementation of less sophisticated search queries or the Internet user simply does not possess the information necessary to build a complex search query that will allow the return of appropriately focused results. When companies, individuals or organizations with similar characteristics engage like optimization strategies, however, the differentiation gained from them diminishes and the value declines for them and Internet users alike.
Search engines employ machines (known as spiders or crawlers) that traverse Internet-accessible directories, web pages, and other information in order to determine location, content, and otherwise index resources that are available electronically. One way that machines traverse these electronic resources is by following links from one resource to another. In some cases, it may be desirable to differentiate between requests for resources generated by spiders from requests for resources generated by humans.
When a spider reaches a web site, it “crawls” through the links available at the site, following one link to another. For example, a home page or index can present a page that loads when a top level domain (e.g., www.vizibility.com) is requested, and the content of that page can be crawled by following all the links present on that home page, and continuing to recurse further into subpages until all the linked pages have been viewed by the spider.
As such, a web site can attempt to detect a spider by observing how machine(s) associated with a given IP address interacts with the links on the web site. For example, by detecting how quickly links are requested by the same IP address, non-human site navigation can be inferred. Given the heuristic nature of determining whether a given IP address is used by a spider, or shared by a group of people, or the like, further improvements to spider detection remain desirable.