Humans understand and analyze everyday images with little effort. For example, humans understand what objects are in an image, where objects are located relative to other objects in an image, what portion of the image is the background, and what portion of the image is the foreground. Computers, on the other hand, still have trouble with such basic visual tasks as reading distorted text or finding where in an image a simple object is located. Although researchers have proposed and tested many impressive algorithms for computer vision, none have been made to work reliably and generally.
Most of the best approaches for computer vision (e.g. Barnard, K., and Forsyth, D. A. Learning the Semantics of Words and Pictures. International Conference of Computer Vision, 2001, pages 408-415; Duygulu, P., Barnard, K., de Freitas, N., and Forsyth, D. A. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. European Conference on Computer Vision, 2002, pages 97-112; Russell, B. C., Torralba, A. Murphy, K. P. and Freeman, W. T. LabelMe: a database and web-based tool for image annotation. MIT AI Lab Memo AIM-2005-025, September, 2005; and Scheniderman, H. and Kanade, T. Object Detection Using the Statistics of Parts. International Journal of Computer Vision, 2002.) rely on machine learning: train an algorithm to perform a visual task by showing it example images in which the task has already been performed. For example, training an algorithm for testing whether an image contains a dog would involve presenting it with multiple images of dogs, each annotated with the precise location of the dog in the image. After processing enough images, the algorithm learns to find dogs in arbitrary images. A major problem with this approach, however, is the lack of training data, which, obviously, must be prepared by hand. Databases for training computer vision algorithms currently have hundreds or at best a few thousand images (Torralba, A., Murphy, K. P. and Freeman, W. T. The MIT CSAIL Database of objects and scenes. http://web.mit.edu/torralba/www/database.html), which is orders of magnitude less than what is required.
Prior art methods have attempted to gather useful data about images. The ESP Game (von Ahn, L., and Dabbish, L. Labeling Images with a Computer Game. In ACM Conference on Human Factors in Computing Systems (CHI), 2004, pages 319-326) is two-player game that collects word labels for arbitrary images. The ESP Game collects images from the Web and outputs word labels describing the contents of the images. The game has already collected millions of labels for arbitrary images. Given an image, the ESP Game can be used to determine what objects are in the image, but cannot be used to determine where in the image each object is located. Such location information is necessary for training and testing computer vision algorithms, so the data collected by the ESP Game is not sufficient for some purposes. The present invention improves on the data collected by the ESP Game. The present invention can be used to output precise location information and other information useful for training computer vision algorithms for each object in the image.
The Open Mind Initiative (e.g., Stork, D. G. and Lam C. P. Open Mind Animals: Ensuring the quality of data openly contributed over the World Wide Web. AAAI Workshop on Learning with Imbalanced Data Sets, 2000, pages 4-9; Stork, D. G. The Open Mind Initiative. IEEE Intelligent Systems and Their Applications, 14-3, 1999, pp. 19-20) is a worldwide effort to develop “intelligent” software. Open Mind collects data from regular Internet users (referred to as “netizens”) and feeds it to machine learning algorithms. Volunteers participate by answering questions and teaching concepts to computer programs. However, The Open Mind Initiative does not offer a fun experience for the volunteers who participate. It is not expected that volunteers will annotate the needed images in the format used by The Open Mind Initiative because there is not sufficient incentive or entertainment for doing so.
LabelMe (Russell, B. C., Torralba, A. Murphy, K. P. and Freeman, W. T. LabelMe: a database and web-based tool for image annotation. MIT AI Lab Memo AIM-2005-025, September, 2005) is a web-based tool for image annotation. Anybody can annotate data using this tool and thus contribute to constructing a large database of annotated objects. The incentive to annotate data is the data itself. You can only have access to the database once you have annotated a certain number of images. LabelMe relies on people's desire to help and thus assumes that the entered data is correct.
Another area of related work is that of interactively training machine learning algorithms (e.g., Fails, J. A., and Olsen, D. R. A Design Tool for Camera-Based Interaction. In ACM onference on Human Factors in Computing Systems (CHI), 2003, pages 449-456). In these systems, a user is given immediate feedback about how well an algorithm is learning from the examples provided by them. As with other prior art attempts, this prior art fails to provide motivation to participants and it does not have significant controls to ensure that the data collected is accurate.
U.S. Pat. No. 6,935,945, issues to Orak, describes an Internet game in which players are shown portions of an image and the players guess what the image is. The Orak patent, however, fails to teach how useful information about the images can be captured. In addition, the Orak patent fails to teach creating a database of useful information about the images. As a result, the Orak patent may describe an entertaining Internet game, but it fails to teach how to solve the problems in the prior art.
Accordingly, there is a need for a database and a method, apparatus, and system for creating a database with a large number of images, for example, to train computer vision algorithms to recognize one or many different kinds of images. The database should be annotated with information about what objects are in the image, where each object is located, and how much of the image is necessary to recognize it.