Bots, also known as web robots, spiders or web crawlers, are software applications that run automated tasks over the Internet. Typically, bots perform tasks that are both simple and structurally repetitive, at a much higher rate than would be possible for a human being. The largest usage of bots is web crawling, in which an automated script fetches, analyses and files information from web servers. Bots are used for many purposes; mainly for browsing, mapping and indexing data; monitoring the behavior of sites; advertising purposes; and, for commercial or academic research. In addition to their uses outlined above, bots may also be implemented where a response speed faster than that of human's is required (for example, gaming bots and auction-site robots) or, less commonly, in situations where the emulation of human activity is required (for example, chat bots). Unfortunately, there are also malicious bots, such as spam bots, that harvest email addresses from contact forms or guestbook pages; downloader programs that suck bandwidth by downloading entire web sites; web site scrapers that grab the content of web sites and re-use it without permission on automatically generated doorway pages; and custom crawlers, tailored for specific websites to steal information (typically regarding index sites, classifieds and large database sites) or spam (typically regarding forums, web mail and social networks) and the like.
From a technical aspect, bots can be divided into three main types: The first type is protocol based bots. These bots continuously generate a request using a certain protocol (such as, for example, HTTP or FTP) and receive a response, which is typically sent to a parser for analysis. These bots are simple and usually operate fast. They do not render the content they receive and hence have no browser capabilities. The second type is application bots which are based on protocol based bots but have more sophisticated parsing tools that render and interpret portions of the response (typically by having JavaScript capabilities). The third type is browser bots which are browsers (such as, for example, Internet Explorer, Firefox etc.) or browser platforms (such as, for example, Webkit) being controlled by an automation script. Browser bots are mechanically operated rather than being controlled by a human user.
There have been many attempts to identify and filter out malicious bots, such as, for example, by analyzing log files and/or by analyzing the frequency of HTTP requests per IP or by using a CAPTCHA. A CAPTCHA (“Completely Automated Public Turing test to tell Computers and Humans Apart”) is a type of challenge-response test used in computing to ensure that the response is not generated by a computer. The process usually involves one computer (a server) asking a user to complete a simple test which the computer is able to generate and grade. Because other computers are unable to solve the CAPTCHA, any user entering a correct solution is presumed to be human. Thus, it is sometimes described as a reverse Turing test because it is administered by a machine and targeted to a human, in contrast to the standard Turing test that is typically administered by a human and targeted to a machine. A common type of CAPTCHA requires that the user types letters or digits from a distorted image that appears on the screen.
CAPTCHAs are vulnerable to hackers, both by sophisticated custom made OCR systems which recognize the distorted text, or by simple relay hack (A bot displays the CAPTCHA to a human user who fills it in, in order to let the bot carry on its crawling activity). CAPTCHAs are typically presented to users only in the event of form filling, in order to avoid interruption to web application flow; thus, any activity done before or after filling the form can be easily driven by a bot. Another attempt to identify bots is honey pots or spider traps, which are normally web pages accessible only from transparent links (e.g. white text on white background). Such honey pots assume those who browse these hidden pages are bots. Honey pots are only useful for identifying generic bots such as email harvesters.
Unfortunately, known bot identification methods can identify a suspicious activity by the user IP level or by the Session ID level. If a session is blocked, the bot can easily restart another session (typically by deleting a cookie file), while if an IP is blocked, legitimate users who may try to access the site from the same IP are blocked as well. Further more, these methods tend to yield too many false-positives (false identification of bots), or, if applied too carefully, too many false-negatives.
There is thus a need in the art for more efficient and reliable method for identifying bots and blocking them with less interruption to genuine human users.