There are many Internet or web based services that have a need to distinguish between a human and a computer user interacting with the service. For example, there are many free e-mails services that allow a user to create an e-mail account by merely entering some basic information. The user is then able to use the e-mail account to send and receive e-mails. This ease of establishing e-mail accounts has allowed spammers to produce computer programs to automatically create e-mail accounts with randomly generated account information and then employ the accounts to send out thousands of spam e-mails. Web services have increasingly employed Turing test challenges (commonly known as a Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA™) or Human Interactive Proof (HIP)) in order distinguish between a human and a computer as the user of the web service. The web service will only allow the user to employ the service after the user has passed the HIP.
The HIP is designed so that a computer program would have difficulty passing the test, but a human can more easily pass the test. All HIPs rely on some secret information that is known to the challenger but not to the user being challenged. HIPs or CAPTCHAs™ can be divided into two classes depending on the scope of this secret. In Class I CAPTCHAs™, the secret is merely a random number, which is fed into a publicly known algorithm to yield a challenge. Class II CAPTCHAs™ employ both a secret random input and a secret high-entropy database. A critical problem in building a Class II CAPTCHA™ is populating the database with a sufficiently large set of classified, high-entropy entries.
Class I CAPTCHAs™ have many virtues. They can be concisely described in a small amount of software code; they have no long term secret that requires guarding; and they can generate a practically unbounded set of unique challenges. On the other hand, their most common realization, a challenge to recognize distorted text, evinces a disturbingly narrow gap between human and nonhuman success rates. FIG. 2A shows an example of a simple class 1 CAPTCHA™ displaying a random text string. The figure shows clearly segmented characters. Optical character recognition algorithms are competitive with humans in recognizing distinct characters, which has led researchers toward increasing the difficulty of segmenting an image into distinct character regions. FIGS. 2B through 2E show common ways in which class I CAPTCHAs™ are modified in an attempt to make it more difficult for a computer program to correctly recognize the characters. However, this increase in difficulty affects humans as well. The owners of web services must be careful to not make the challenge so difficult that it drives away real human users from expending the effort to user their service. Even relatively simple challenges can drive away a substantial number of potential customers.
Class II CAPTCHAs™ have the potential to overcome the main weaknesses described above. Because they are not restricted to challenges that can be generated by a low-entropy algorithm, they can exercise a much broader range of human ability, such as recognizing features of photographic images captured from the physical world. Such challenges evince a broad gulf between human and non-human success rates, not only because general machine vision is a much harder problem than text recognition, but also because image-based challenges can be made less bothersome to humans without drastically degrading their efficacy at blocking automatons.
A significant issue in building a Class II CAPTCHA™ is populating the secret database. Existing approaches take one of two directions: (a) mining a public database or (b) providing entertainment as an incentive for manual image categorization. A problem with these approaches is that the public source of categorized images is small or available to attackers. Therefore, a small, fixed amount of manual effort spent reconstructing the private database can return the ability to solve an unbounded number of challenges. There is a need to make available to the CAPTCHA™ a private database that has a selection of accurately manually categorized images that is both substantially accurate and also sufficiently large enough to make it cost prohibitive for an entity attempting to automate a computer program for passing the challenge to reconstruct all or a significant portion of the categorized image database.