Automatically determining the semantic classification (e.g., sunset, picnic, beach) of an arbitrary image is a difficult problem. Much research has been done recently, and a variety of classifiers and feature sets have been proposed. The most common design for such systems has been to use low-level features (e.g., color, texture) and statistical pattern recognition techniques. Such systems are exemplar-based, relying on learning patterns from a training set (see A. Vailaya, M. Figueiredo, A. Jain, and H. J. Zhang, “Content-based hierarchical classification of vacation images”, Proceedings of IEEE International Conference on Multimedia Computing and Systems, 1999). Such exemplar-based systems are in contrast to model-based systems, in which the characteristics of classes are specified directly using human knowledge, or hybrid systems, in which the model is learned.
Semantic scene classification can improve the performance of content-based image organization and retrieval (CBIR). Many current CBIR systems allow a user to specify an image and search for images similar to it, where similarity is often defined only by color or texture properties. This so-called “query by example” has often proven to be inadequate. Knowing the category of a scene a priori helps narrow the search space dramatically. For instance, knowing what constitutes a party scene allows us to consider only party scenes in our search to answer the query “Find pictures of Mary's birthday party”. This way, the search time is reduced, the hit rate is higher, and the false alarm rate is expected to be lower.
Current scene classification systems enjoy limited success on unconstrained image sets. What are the reasons for this? The primary reason appears to be the incredible variety of images found within most semantic classes. Exemplar-based systems must account for such variation in their training sets. Even hundreds of exemplars do not necessarily capture all of the variability inherent in some classes. Take the class of sunset images as an example. Sunset images captured at various stages of the sunset can vary greatly in color, as the colors tend to become more brilliant as the sun approaches the horizon, and then fade as time progresses further. The composition can also vary, due in part to the camera's field of view: does it encompass the horizon or the sky only? Where is the sun relative to the horizon? Is the sun centered or offset to one side?
A second reason for limited success in exemplar-based classification is that images often contain excessive or distracting foreground regions, which cause the scene to look less prototypical and thus not match any of the training exemplars well. For example, FIG. 1 shows four scenes (a)-(d) with distracting foreground regions. This is especially true in consumer images, where the typical consumer pays less attention to composition and lighting than would a professional photographer. Therefore, consumer images contain greater variability, causing the high performance (on professionally-taken stock photo libraries) of many existing systems to decline when used in this domain.
Consequently, a need exists for a method that overcomes the above-described issues in image classification. These issues are addressed by introducing the concept of spatial image recomposition, designed to minimize the impact of undesirable composition (i.e., foreground objects), and of simulated or effective temporal image recomposition, designed to minimize the effects of color changes occurring over time.
This approach is supported by past success in other domains. In face recognition and detection, researchers used perturbed versions of faces in training (e.g., see H. Rowley, S. Baluja, and T. Kanade, “Rotation invariant neural network-based face detection”, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1998) in order to handle geometric variation. This is related to resampling or bootstrapping. In addition, bagging (bootstrap aggression) uses multiple versions of a training set to train a different component classifier and the final classification decision is based on the vote of each component classifier (see R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. John Wiley & Sons, New York, 2001, pp. 475-476).