With the popularity of the Internet, various Web services are increasingly becoming part of everyday life, such as e-commerce, e-mail services, downloads and more resources, which are often free to access. However, these services intended for the human users are often abused or attacked by unauthorized users and malicious computer programs. These unauthorized activities take up service resources, generate a lot of Web junk and spamming, affect user experiences, and threatens the security of Web services.
Techniques exist for distinguishing humans from machines in order to reduce the attacks by machines word computers. An example of such techniques is CAPTCHA (Completely Automated Public Turing test to tell computers and humans apart), which is a security measure for authentication using challenge-response tests to tell a human from a computer. CAPTCHA uses a server computer to generate CAPTCHA challenges (tests) and to evaluate the responses. As a user intends to use a Web service which requires authentication, the server provides the user with a challenge (test); the user responses to the challenge by submitting the response to the server; and the server assesses whether the user has met the challenge according to the response.
The current CAPTCHA techniques primarily include text CAPTCHA, image CAPTCHA, and sound CAPTCHA. These techniques are based on different issues identified in the field of Al, and have different characteristics.
Text CAPTCHA takes advantage of the differences in human recognition of textual characters and machine recognition of the same, and uses verification codes to distinguish between human and machine. A verification code used in text CAPTCHA may be a distorted picture created using a string of numbers or symbols randomly generated. Interference pixels are often added to the distorted picture to prevent optical character recognition. A human user visually recognizes the verification code contained in the distorted picture, and submits the recognized verification code, and is allowed to use the service if the submitted verification code is correct. Examples of text CAPTCHA are the website of CSDN (Chinese Software Developer Network) which uses GIF format+numbers for user logon; website of QQ which uses randomly generated characters for website registration, and uses PNG format images with random numbers+random capital letters for logon; MSN and Hotmail which uses BMP format with random numbers+random capital letters+random interference for new account registration+random bits; Google Gmail which uses JPG format with random numbers+random colors+random lengths+random positions for new account registration; and certain large forums which adopted XBM format with random content.
Image CAPTCHA takes advantage of the differences between humans and machines in image classification, object identification, commonsense understanding and other aspects. Image CAPTCHA is usually independent of different languages, requires no user input text, and is harder to crack. One example of image CAPTCHA is CAPTCHA BONGO designed by Carnegie Mellon University which uses two types of visual model (such as lines, circles, boxes, etc.), and allows users to determine the type of the new model. However, the design of selecting one model out of the two options cannot guarantee safety. Another example is CAPTCHA using an annotated image database. A weakness of this algorithm lies in that once the database is leaked, the algorithm's security collapses. Google's What's Up app uses a CAPTCHA algorithm that is based on image orientation recognition. An image is rotated perpendicular to its original orientation. What's Up requires no image annotation. It continues to add candidate images during the tests, and uses user feedback to correct initial annotations. Furthermore, What's up has an automatic image orientation filter that is trained to detect and filter out images that are machine recognizable, and also detect and filter out images that are difficult for human recognition, to ensure that the test can be passed by most human users but not by machines. Compared with CAPTCHA based on the image classification, What's Up challenges the users with more difficult image-understanding questions and requires the user to analyze the content of the image. The technique enjoys a very large usable base of images which are not limited to specific items, and its automatic image annotation based on user feedback also results in less tedious human intervention.
Sound CAPTCHA takes advantage of the differences in human and machine speech recognition. The technique plays at random intervals one or more human-spoken numbers, letters or words, and adds background noise to resist ASR (Automatic Speech Recognition) attack. For example, in sound CAPTCHA BYAN-I and BYAN-II, the user is prompted to select a preferred language, while the computer randomly selects six numbers, generates a test audio accordingly, and adds another speaker's voice as background noise. The user is prompted to enter the six numbers correctly recognized in the audio. In BYAN-I, background noise is the same six numbers spoken in a different language, while in BYAN-II the background noise is the sound of random selected numbers or words.
The current mainstream CAPTCHA techniques, while being capable of avoiding some degree of malicious computer program abuse of Web services, are vulnerable to a variety of attacks and easy to crack, and result in poor user experiences.
More specifically, although text CAPTCHA that distorts the text does to a degree prevent a malicious computer program from registering or logging on, the advancements in character segmentation and optical character recognition (OCR) have cracked most text CAPTCHA algorithms. CAPTCHA algorithms based on simple character recognition challenges are no longer able to stop computer programs. Besides, distorted text images are difficult to be recognized by humans, and result in poor user experience.
Image CAPTCHA takes advantage of the differences between humans and machines in image classification, object identification, commonsense understanding and other aspects. This technique is usually independent of different languages, requires no user input text, and more difficult to crack than text CAPTCHA. However, image CAPTCHA requires extensive database support, is difficult to be produced in large scale, and is further vulnerable to attacks by machine learning algorithms. For example, Golle designed an SVM (Support Vector Machine) classifier which combines colors and texture features to classify images of cats and dogs, and achieved 82.7% success rate on single images and a success rate up to 10.3% on Asirra which contains 12 graphs.
Sound CAPTCHA is equally susceptible to attacks by machine learning algorithms. Tam et al. uses a fixed-length audio search window to identify energy peaks for recognition, and extracts three kinds of audio features therefrom, namely Mel-Frequency Cepstral Coefficient, Perceptual Linear Prediction, Correlation Spectral Inversion Perceptual Linear Prediction, and uses three machine learning algorithms (AdaBoost, SVM, and k-NN) to train. The method achieved success rates of 67%, 71% and 45% on Google, Digg and ReCAPTCHA, respectively. Similar methods were also used to crack eBay's voice CAPTCHA, achieving a success rate up to 75%.