Many networks, including wide area and local area networks, include one or more servers which provide access to web pages.
The Internet, for example, is a vast wide area network including a large number of servers which host a massive number of web pages. Various different services are provided which catalogue the information which is available on these web pages. For example, a search engine service must gather information from web pages in order to respond to search requests from a user.
Web crawlers are commonly used in many different systems to gather information from web pages and to deliver this information to a cataloguing module which records information about the content of the web pages in association with one or more identifiers for the web page. The one or more identifiers may include, for example, a title, a website identifier (the web page being associated with a website), a URL (uniform resource locator)
One or more of the web pages which are accessed by the web crawler may be an illegitimate web page. An illegitimate web page may be a web page which seeks to improve its own standing, or the standing of another web page, in search results of a search engine service. In other words, the illegitimate web page may attempt to take advantage of the mechanisms used by search engine services in the ranking of web pages in sets of search results.
This is often referred to as ‘link spam’ and may include the use of ‘link farms’ in which the illegitimate web page is linked to or includes links to one or more other illegitimate or legitimate web pages with the main purpose of boosting a particular website or web page in the results produced by a search engine.
An illegitimate web page may be considered to be an illegitimate web page for other reasons too.
For example, the web page may be attempting to mimic another web page with a view to tricking a user into entering a password, a username, or the like, which the operator of the illegitimate web page will gather and then use to access the corresponding legitimate web page illicitly. This is commonly known as ‘phishing’.
Other illegitimate web pages may be configured to upload one or more illicit computer programs to the user's computer when the user accesses the web page using their computer.
Other illegitimate web pages may, for example, include information which is illegal or allows a user to infringe the intellectual property rights of another.
There is a desire for web crawlers to be able to identify such web pages. In the example of a search engine service, the service providers may want to avoid the listing of an illegitimate web page in the search results which are provided as a result of a user search request or may want to relegate the listing of a potentially illegitimate web page in such search results relative to the listings of other web pages.
A web crawler will typically establish a communication link with a server which hosts website; the web crawler may issue a request over the communication link which includes a request for a response file which is by convention hosted by the server with the filename “robots.txt” (each website hosted by a server may have a different such response file). This file includes information which directs the web crawler and provides the web crawler with information regarding how the web crawler should access the one or more web pages hosted by the server as part of the website. The inventors have devised methods and systems by which such a file can be used advantageously in the detection of potentially illegitimate websites.
Embodiments of the present invention seek to ameliorate one or more problems associated with the prior art.