Automatically determining the semantic classification (e.g., mountain, sunset, indoor) of an arbitrary image has many useful applications. It can help consumers to organize their digital photographs into semantic categories. It can also make camera- and minilab-based digital enhancement and manipulation more powerful. Rather than applying generic enhancement algorithms (e.g., color balancing) to all images, scene knowledge can allow us to use customized, scene-specific algorithms.
Semantic scene classification has been studied much in recent years (see for example, A. Vailaya, M. Figueiredo, A. Jain, and H. J. Zhang, “Content-based hierarchical classification of vacation images”, Proceedings of IEEE International Conference on Multimedia Computing and Systems, 1999). Most current classifiers use the low-level content (e.g., colors, textures, edges) of the image only and have achieved some success on constrained image sets (e.g., the Corel stock photo collection). However, on unconstrained consumer images, scene classification is still very much an open problem, especially when only image (e.g., pixel) information is used.
Information beyond pure scene content has only recently started to be exploited to help scene classification. An untapped source of image context lies in its temporal context: the images surrounding the image being classified. Use human behavior as an example: when humans classify a sequence of images, they tend to assume that neighboring images are related, unless the scene content changes dramatically. The reason behind such a subconscious assumption is that real-world events occur consecutively and sequentially in terms of subject, time, and location, and are recorded accordingly by the cameras. In applications involving image collections where images are clustered sequentially, surrounding images can be used as context. This is true in cases of indoor/outdoor and sunset scene classification, as well as image orientation detection.
Time and date information, if accurate, could be used to derive seasonal variations that could prime content-based object and scene detectors (e.g., sunrise, night, or snow detectors). However, this would also need to be coupled with the geographic location in which the image was captured to be accurate (e.g. time of sunrise is primarily a function of degrees longitude). While this may be possible in the future, as GPS, cellular-phone, and digital camera technology continues to merge, it is not currently available. Furthermore, many amateur photographers do not set their clocks correctly on their cameras, so absolute time information appears too unreliable to use.
Relative time information (elapsed time between photographs) has been used successfully to cluster or group photographs by events (for examples, J. Platt, “AutoAlbum: Clustering digital photographs using probabilistic model merging”, in IEEE Workshop on Content-based Access of Image and Video Libraries, 2000, and J. Platt, M. Czerwinski, and B. Field, “PhotoTOC: Automatic clustering for browsing personal photographs”, Microsoft Research Technical Report MSR-TR-2002-17, February, 2002.), complementing content-based clustering strategies. Loui and Savakis, in “Automatic image event segmentation and quality screening for albuming applications”, Proceedings of IEEE International Conference on Multimedia and Expo, New York, July 2000, assumed the use of time metadata and assume that intra-event time differences are smaller than inter-event differences. This leads naturally to their event segmentation algorithm: perform 2-means clustering on the time-difference histogram. The histogram is appropriately scaled to perform meaningful clustering.
Using elapsed time is becoming more popular in related fields as well; for example, Mulhem and Lim recently used the classification of images within a cluster to improve image retrieval, in “Home photo retrieval: time matters”, Lecture Notes in Computer Science, 2728:321-330. 2003.). Their metric for relevance between a query and a database image D incorporates both the match between the query and D, but also the best match between the query and the best-matching image in the same temporal cluster as D.
However, compared to image clustering (e.g., Loui and Savakis) and image retrieval (e.g., Mulhem and Lim), there has not been any known attempt at using temporal context in image classification, where an image is assigned to a semantic scene category. While one could use clustering as a precursor to classification, this is not necessarily the best approach, since clustering errors would propagate to the classification stage, degrading performance. Another advantage of operating without the need for clustering is that it also avoids the computational overhead of performing clustering in advance. In addition, it is advantageous to use a probabilistic framework for modeling and enforcing temporal context, as opposed to handcrafted rule-based systems such as Mulhem and Lim.
Consequently, a need exists for a method that takes advantage of temporal context to improve image classification in order to overcome the above-described issues in image classification. These issues are addressed by first classifying images in isolation using a content-based classifier, and then imposing a proper temporal context model (e.g., a Markov Chain) consisting of entire sequences of images, thereby correcting mistakes made by a content-based classifier.