A CAPTCHA is an acronym for “Completely Automated Public Turing test to tell Computers and Humans Apart” and is a challenge-response test used to determine whether a user is a human or a computer. Such programs are in common use on the World Wide Web and often take the form of images with distorted text in them. CAPTCHAs are used to protect many types of services, including e-mail services, ticket selling services, social networks, wikis, and blogs. They are frequently found at the bottom of Web registration forms and are used, for example, by Hotmail, Yahoo, Gmail, MSN Mail, PayPal, TicketMaster, the United States Patent and Trademark Office, and many other popular Web sites to prevent automated abuse (e.g., programs that are written to obtain many free email accounts every day). CAPTCHAs are effective because computer programs are unable to read distorted text as well as humans can. In general, CAPTCHAs prompt users to prove they are human by typing letters, numbers, and other symbols corresponding to the wavy characters presented in the image.
However, prior art CAPTCHAs have certain drawbacks. In particular, the images used in the prior art CAPTCHAs are artificially created specifically for use as CAPTCHAs, and they are not always well chosen to distinguish between human and non-human users. As a result, spammers and others attempting to circumvent the prior art CAPTCHA systems are becoming increasingly efficient at using computers to correctly answer prior art CAPTCHAs. As a result, there is a need for a more effective way to produce CAPTCHAs that are difficult for computers to answer and are also reasonably easy for humans to answer.
Humans around the world solve over 60 million CAPTCHAs every day, in each case spending roughly ten seconds to type the distorted characters. In aggregate this amounts to over 150,000 human hours. This work is tremendously valuable and, almost by definition, it cannot be done by computers. At present, however, prior art CAPTCHAs do not provide for any useful end for this work aside from using it as a way to restrict access to human users. As a result, there is a need for making more efficient use of the considerable time that is collectively spent solving CAPTCHAs.
Furthermore, physical books or texts that were written before the computer age are currently being digitized en masse (e.g., by The Google Books Project, and The Internet Archive) in order to preserve human knowledge and to make information more accessible to the world. The pages are being photographically scanned into image form, and then transformed into text using optical character recognition (“OCR”). The transformation from images into text by OCR is useful because images are difficult to store on small devices, are expensive to download, and cannot be easily searched. However, one of the biggest stumbling blocks in this digitization process is that OCR is far from perfect at deciphering the words in images of scanned texts. For older prints, where the ink has faded, the pages have turned yellow, or other imperfections exist on the paper, OCR cannot recognize approximately 20% of the words. In contrast to computers, humans are significantly more accurate at transcribing such print. A single human can achieve over 95% accuracy at the word level. Two humans using the “key and verify” technique, where each types the text independently and then any discrepancies are compared, can achieve over 99.5% accuracy at the word level (errors are not fully independent across multiple humans). Unfortunately, human transcribers are expensive, so only documents of extreme importance are manually transcribed.
Accordingly, there is a need for improved methods and apparatuses related to CAPTCHAs and, particularly, for methods and apparatuses related to CAPTCHAs that offer advantages beyond controlling access to computer systems, such as for cost-effectively transforming written text into electronic form that can be stored and searched efficiently. Those and other advantages of the present invention will be described in more detail hereinbelow.