The Problem
In his 1950 paper Computing Machinery and Intelligence1, Alan Turing proposed his now famous test, in which a computer is said to be thinking if it can win a game in which a human judge attempts to distinguish between human and mechanical interlocutors.
However, over time it has become apparent that the inverse of that question has become more pressing: can a machine distinguish between human operators and other machines?
The reason for this is that commercial and social networking applications on the Internet are becoming increasingly plagued by unscrupulous marketers, and opportunists who use software to exploit interfaces intended for human users to flood websites, online forums and mail servers with unsolicited marketing—or worse yet, by criminals who exploit weaknesses in human interfaces to capture data for fraudulent purposes.
If a person is limited to interacting with a computer system by physically typing requests, the amount of data he can gather, and the amount of damage he can do is limited; but with the aid of malicious software, a single operator can flood a network with millions of spam messages, or make thousands of requests for data in just a few seconds.
It turns out that limiting human interfaces to human operators is a critical task, and a substantial amount of intellectual property has been devoted to this problem—especially in the past few years. The so-called “Reverse Turing Test” has become an important problem for software developers.
The problem is that none of the current technologies are completely effective. Automated programs created by spammers have proven to be as much as 35%2 effective when deployed against commercial solutions like Microsoft's Live Mail and Google's Gmail service.
Most of the research so far has focused on the mechanical aspects of how human beings recognize images, and a lot of effort has gone into discovering ways to distort images so they are still human-recognizable, but are computationally expensive for machines to resolve.
The standard “Captcha”, or reverse Turing test uses a sequence of glyphs, (letters and numbers), that have been run together, or warped, or have lines drawn through them, or have otherwise been altered to make them difficult to isolate and classify.
For their part, spam marketers and other agents who want to break live person verification systems have been developing technology to break down the job of recognition into three steps: preprocessing and noise reduction; segmentation; and classification.
The problem with using simple glyphs like letters and numbers is that there aren't many of them that are in regular use by humans, (for practical purposes they're pretty much limited to the characters on a typical computer keyboard), and in order to be recognizable at all, they must obey basic rules with regard to silhouette. This means that if you distort the glyphs enough that they can't readily be classified with software, human readers likely won't be able to recognize them either.
Some developers have attempted to use shape or image recognition instead of glyphs as a reverse Turing test. For example, Microsoft's Asirra uses a database of pet images provided in partnership by Petfinder.com. Users are asked to separate cats from dogs in a list of photographs.
Here again, there's a problem. Spam marketers who wish to break image recognition tests have demonstrated that they can simply enlist human agents to collect and classify images from very large databases in a surprisingly short time. From that point on, it's simply a matter of digital “grunt work” to compare known images with those presented by a reverse Turning test. This is the kind of work that computers excel at.
Systems that use shape recognition as a reverse Turing test can be broken by a similar process and with even less effort, since you generally have to use a restricted range of simple silhouettes that won't confuse human users.
The fact is, computers have become so powerful and inexpensive that you can't rely on computational expense to protect computer networks from machine agents.
An Epistemological Approach
Curiously, most of the research I have read in this field is related to the mechanical process of how people see—how they isolate shapes from the background, and segment them into individual objects.
There seems to be a surprising lack of epistemological curiosity as to how it is that humans know what a thing is once they have perceived it. Machines can be trained to perceive things. For many academics jury is still out as to whether they can ever know things.
For my part, I don't believe they can. A computer is a remarkably simple machine that inhabits an entirely pragmatic and platonic universe: it can only recognize a thing by comparing it against the same thing. Otherwise, it can only compare similarities.
You can use a machine to compare apples to oranges, but to a computer, an apple can only be said to be an apple if it's the same apple you started with. Only human beings can encompass the idea of an apple.
In other words, human beings recognize objects as ideas. More importantly, they can just as quickly grasp a whole host of associations between ideas that are unpredictable, in some cases illogical—and always human.
It is these semantic associations that tell us, for example, that a shabby, comfortable chair belongs at a cheerful fireside, while a sleek plastic office chair does not.
I believe that in the long run, the only truly successful test for a human presence on a computer system requires that we exploit the semantic and symbolic associations that a human being can make—and will always try to make in any random collection of objects; and that a machine by definition can not.
To be successful, a reverse Turing test can only be composed or created by a human agent, although it can be administered by a machine.
The Proposed Test
What I propose in this invention is a system where a computer will assemble a visual test out of associations created in advance by human operators. Essentially, there are two variations on the test: one is to find two or more objects in an apparently random collection that should go together. In the other variation, the subject has to find the object that doesn't belong—much like the old association game on the PBS television program, Sesame Street.
Because of the arbitrary fashion in which humans associate things, a relatively small database of images can result in thousands of matches—often incorporating the same objects in different ways. For example, consider the following objects: dog, boy, steak, frying pan, fish, baseball bat, baseball, table, and chair.
The dog is compatible with the boy, the ball, the steak, and possibly the fish, but not the table or the frying pan. The steak and the fish are compatible with the frying pan, and possibly the table, but the table is more compatible with the chair.
Humans will naturally associate images that have the strongest association, so if they are asked to match the table with any of the other objects, they will almost always choose the chair. After all, you almost always sit on a chair when at a table—but the steak and the fish or confusing. A human being will cast about looking for a plate and possibly a knife and fork.
This is because humans instinctively organize objects in collections. A machine has no way of making the arbitrary associations that allow humans to collect objects that often have no immediate and discernible qualities in common.
Subtle differences in objects can affect their association as well. It makes sense to associate a boy and his dog, but it makes more sense to the person taking the test if the dog is a beagle than it does if the dog is a pit bull terrier.
How it Would Work
We can create a test that can be assembled and administered by a machine, but only if the essential semantic associations that it is based on are first created by human operators. The test would be assembled from photo objects, each of which would be associated with metadata recorded by human operators.
Semantically, we tend to classify objects in three ways: qualitatively, or in terms of its own properties, (is it soft, or hard, or shiny?); functionally, or in terms of what it does; and in terms of its emotive context, (how does it make you feel?).
Each image would be represented in a database with three sets of metadata which would consist of tags describing the emotive, qualitative, and functional properties of the object with keywords. And—this is the important part—the metadata would have to be created by human operators who would describe the objects in the images in human terms.
The test could then be assembled by an artificial intelligence methodology that simply weighted sets of images based on the correspondence of metadata in each of the three categories. The test would be effectively tunable in terms of “fuzziness”, (based on the broadness of the correspondence of keywords over the categories), and difficulty, (by simply forcing users to differentiate between matches where there are points of correspondence between all of the images).
Mechanical Improvements
Naturally, I have given thought to increasing the computational expense of collecting photo objects from the test and trying to re-create the relationships that are used in the test. In this case, I believe that the advantage lies with the agency administering the test rather than those who try to break it.
This is because they can only program computers to recognize the specific photo objects they encounter. They will need to employ human effort to associate the images and rebuild relationships, which is far more difficult in a fluid system than merely collecting images, especially since they can only solve for relationships amongst images they have already encountered, (which means the reverse-engineer effort is not easily distributable).
However, there is a very simple way to make it prohibitively difficult to collect and extract the photo objects used in any given collection: to do this, they would be overlaid on a photo background with a busy texture, using a soft edge and random variations in rotation and scaling. Once all of the images are assembled, the resulting composite would have a randomly modulated blend texture applied to it. The blend texture would be a regular shape repeated at random intervals and positions, and blended using a variety of additive, multiply or subtractive methods with a varying, low alpha.
Since photo objects are inherently more complex than glyphs, less distortion is required in order to render them useless for comparison and classification, yet is possible to subject them to more distortion and to completely change their orientation while they still remain recognizable. Because of this, the resulting image would still be highly recognizable to humans, but not easily compared to other instances of the same thing.
The Case for Restricting the Embodiments to Images
There are some patents that deal with using audio cues to administer a reverse Turing test. Some involve identifying spoken letters and numbers, and one proposal suggests requiring users to type in rhythm with an audio cue.
By extension, you could require test subjects to identify and associate audio cues. With a little imagination, one can easily see how, for example, a human might associate the melody from “Happy Birthday” with the sound of children laughing, or the sound of tissue paper tearing.
Humans also have an innate ability to recognize melodic structure, so a similar test might involve matching snippets of the same melody recorded a cappella or using different instruments.
However, there are two problems with using audio cues in a reverse Turing test. The first is that audio cues are very easy to collect and match by mechanical means, and humans are far more easily confused by alterations in sound than machines. Reversing an audio clip wouldn't disguise it from an audio matching program, but would make it unrecognizable to most humans.
The second problem is simply on of convenience. Most sighted persons primarily interact with computers in a visual manner. They often have their audio turned off, or if not, they're just often listening to music or voice recordings while they work.
Most people would consider it an inconvenience to have to turn off their music or turn up their speakers in order to take a test in order to access a resource.
There is still a valid case to be made for developing a reliable reverse Turing test for blind persons that uses audio cues, but that is beyond the scope of this invention.