The scientific field of pattern recognition has developed in recent decades to encompass ever more sophisticated kinds of signal descriptions and manipulations. The convergence of advances in the sciences of pattern recognition, digital image processing, and high speed computing has led to the evolution of a new field of “image understanding”. The goal of this field can be stated as the extraction of semantic level information from a digital image. “Semantic level” information is intended to be understood as the equivalent of higher-level human kinds of interpretation of images. For example, a person might look at an image and say “This is a picture of my son when he was a baby” or “This is a picture of our daughter's wedding”. Such information is semantic because it incorporates knowledge of concepts that are beyond the mathematical description of image signals, but that hold rich meaning for people.
Using mechanisms that are still little understood, but without doubt based in the massive computing resources of the human brain on both the nerve cellular and molecular level, the eye/brain system converts incoming visual information, measured in physical units, arriving at the retina of the eye, into rich semantic understanding and responses. Although the present state of image understanding technology falls far short of such a sophisticated level of processing, it is still possible to derive a certain amount of semantic information using fairly low-level processing in various artificial learning systems.
A pertinent example is presented in M. Szummer and R. Picard, “Indoor-Outdoor Image Classification”, Proc. IEEE Int'l. Workshop on Content Based Access of image and Video Databases, January 1998, where the authors describe a system for performing semantic labeling of digital images as indoor or outdoor scenes. From each digitized image, three sets of extracted information features were computed. The features consisted of (1) three one-dimensional color histograms with 32 bins per channel; (2) texture measurements computed from a multi-resolution, simultaneous auto-regressive model (MSAR), using the coefficients of best fit to the second order model; and (3) frequency information computed from the 2D Discrete Fourier Transform and Discrete Cosine Transform. These features were extracted from the entire image, and from each 4×4 or 8×8 pixel sub-block. Nearest neighbor classifiers were then trained to label each image sub-block as indoor or outdoor. A global classification for the entire image was then performed by one of a variety of arbitration strategies.
In another example of semantic image understanding (M. Gorkani and R. Picard, “Texture orientation for sorting photos at a glance”, Proc. Int'l. Conf. on Pattern Recognition, v1, Jerusalem, Israel, pp. 459–464), the authors present a method that determines whether digital images represent city scenes (dominated by artificial buildings with straight edges and regular patterns of texture details) or natural landscapes.
The images that people create are actually a rich source of information about the events of their lives. Facial information is a particularly rich source of semantic content; however, current systems fail to exploit this information adequately for semantic understanding of images. For example, in U.S. Pat. No. 6,035,055, entitled “Digital Image Management System in a Distributed Data Access Network System”, Wang et al. disclose a system that uses semantic understanding of image data bases in the form of face detection and face recognition to permit access to images in the data base by facial similarity. Facial feature data are extracted from images in the data base and organized into an index structure. Retrieval queries against the data base match the facial feature data obtained from a query image with information in the index structure to retrieve images with similar faces. While this patent uses object recognition technology to raise the abstraction level of data base access operations, it fails to use semantic level understanding of the types of events included in the data base images to further assist the user in understanding the semantic setting of the image. For example, it is not possible to ask the system to retrieve all images of a certain person taken during a birthday party or wedding. Furthermore, this system does not attempt to merge information obtained from multiple images to improve the accuracy of its retrieval operations.
The current state of the art in image understanding is not sufficiently advanced to produce reliable semantic labeling of images. A new strategy for performance improvement is needed. This strategy could be based on the phenomenon that images are often collected into groups having similar semantic themes. For example, an entire set of images might have as a main subject a newly born baby, or scenes from a wedding. If the assumption can be made that images in a set tend to relate to a common semantic theme, then classifier performance on individual images, which might be poor, can be combined using an aggregation-of-evidence approach to create much higher confidence in full group classification.
Furthermore, means must be sought to improve labeling performance in order to produce better results. In one common technique for improving performance in such applications, multiple learning systems might be applied to a problem, with a means for voting or arbitrating conflicting results. For this technique to work, it must be that the true detection rate of the individual machines is very high, so that not many decisions of interest are missed by any one machine. This scheme can help weed out false decision errors.
To the extent that it is possible to autonomously interpret the semantic content of images, business actions could be taken to provide high-value services for their originators. Then, it would become possible to provide imaging products and services whose creation and delivery depend critically on the human-level semantic content of groups of digital images. For example, special pictorial albums containing images from a wedding could be automatically created and provided for sale as part of a photo-finishing business process.