One of the great advantages of the internet is its open architecture, which allows connections to be formed between nodes in the network that have no previous relationship. In this way, for example, a retailer may provide a web site that is accessible to all potential customers globally, without first having to subject the customer to any verification process or establish any physical connection.
The openness of the internet, however, inevitably provides opportunities for malicious agents to conceal their true nature in order to subvert the legitimate use of web resources. Where this relates to the efforts by unauthorised parties to gain access to sensitive information, it is well known to hide that information behind passwords and other security measures. However, there are circumstances in which providers do not wish to conceal information or functionality from the public in general, but only wish to stop misuse.
For example, one technique for monetising user traffic on the internet is CPM (cost per thousand impressions) display advertising. In this arrangement, an advertiser's display advertisement is placed in a web page, and each time that page is requested to be viewed by a visitor a fee is paid to the website owner. As such, the more times the web page comprising a display advertisement is requested to be viewed, the greater the fee that is paid to the website owner.
The premise of the CPM advertising model is that each webpage request is legitimately that of a potential customer of the advertiser. The model breaks down if the requests for webpages comprising display advertisements are carried out with any other purpose. Nevertheless, there may be an incentive for the website owner, or indeed the advertiser's competitors, to perform requests for the purpose of causing the advertiser to pay a fee. Requesting an advertiser's advertisements for this purpose is known as “impression fraud”.
The most common manner in which impression fraud is carried out is by the use of automated programs, often called online robots or “bots”. These bots are malicious agents designed to automatically carry out tasks, such as requesting webpages comprising advertisers' display advertisements, so as to frustrate the intended purpose of the CPM advertising system.
Bots are also used to carry out a number of other malicious activities. For example, bots can be used to overwhelm web sites which provide information to legitimate users by submitting excessive requests. In other circumstances, bots have been used to inject marketing spam into comment sections of web pages. Bots have also been used for content gathering, such as content theft for unauthorised republication or content retrieval for competitive intelligence (for example, retrieving information such as product lists, client lists, pricing details and so on).
A notable feature of bots is that they are not tied to particular devices, and may therefore operate from a variety of sources. For example, while a bot may operate on a device owned by the bot owner, it may also operate on machines rented by the bot owner or even on the machines of legitimate users who are unaware of the bot's presence. In this latter example, the bot may spread to legitimate users' devices in the manner of a computer virus. This variety of sources for bot activities adds to the difficulty in detecting and isolating them, and can provide direct inconvenience to legitimate users who may find their device operating sub-optimally due to the presence of a bot.
In order to counter the problems associated with bots, they must first be identified. There is therefore a need to distinguish between the activities of bots and those of legitimate users in a reliable way.
United States Patent Application US200710255821 proposes three techniques for identifying bots, particularly in the context of click fraud. According to this document, one approach is to check at the end of each 24-hours period whether the number of occurrences of a particular logged parameter—for example, the IP reported in an HTTP header—associated with a resource request over the 24-hour period exceeds some threshold. A second approach is to pick out at the end of each 24-hour period those resource requests for which no client-side events (e.g. a JavaScript-tracked mouse movement) have been logged. A third approach is to check whether particular parameters associated with a resource request—IP, referrer URL and User Agent—may be found in a database of previously detected fraudulent requests (where this database is updated once every 24 hours using the previous two methods).
Although the techniques described in US 2007/0255821 and others like it have some efficacy, they are unable to provide accurate classifications of client devices at sufficient speed. For example, the first two approaches described in US 2007/0255821 only make classifications at the end of each 24 hour period, meaning that up until this point a bot may continue its activities unhindered. On the other hand, if the period described in US 2007/0255821 were reduced, this would reduce the accuracy of the classification.