Image classification systems (i.e. systems in which the content of a single image or photograph is analyzed to determine an appropriate label or descriptor for the image) are known in the art. Such systems are generally used to label or classify images according to predefined textual descriptors. Typically, an image classification system analyzes an image via the use of one or more “classifier” algorithms (described in greater detail below) that identify a predefined label that matches or partially matches an image based on the image content and associate the identified label with the image. For example, an image of a horse on a farm may be labeled “horse,” or “farm,” or both. In some systems, an image or photo may be labeled according to broad categories of image content (e.g. indoor or outdoor, city or landscape, etc.), whereas other systems utilize more narrow categories (e.g. desert, ocean, forest, car, person, etc.). Some systems even classify images based on identified persons in the image (e.g. celebrities, political figures, etc.), objects in the image, etc. These labels or classifications are useful for a variety of purposes, such as association with metadata tags or other identification mechanisms for use in image indexing and retrieval systems, surveillance and security systems, and other similar image recognition purposes.
Such image classification systems utilize a variety of methods to classify images, with varying results. One such technique involves examining the power spectrum of an image in conjunction with Principal Components Analysis (PCA) to identify the type of content in the image, as described in A. Torralba and A. Oliva, Statistics of Natural Image Categories, Network: Computation in Neural Systems, vol. 14, pp. 391-412 (2003). Other approaches include using a “bag of words” with Scale Invariant Feature Transform (SIFT) descriptors (see P. Quelhas and J. Odobez, Natural Scene Image Modeling Using Color and Texture Visterms, Conference on Image and Video Retrieval (CIVR), Phoenix Ariz. (2006)) in combination with Latent Dirichlet Allocation (see L. Fei-Fei and P. Perona, A Bayesian Hierarchical Model for Learning Natural Scene Categories, IEEE Conference on Computer Vision and Pattern Recognition (2005)), probabilistic Latent Semantic Analysis (see A. Bosch et al., Scene Classification Via pLSA, ECCV (4), pp. 517-30 (2006)), or a spatial pyramid (see S. Lazebnik et al., Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 2169-78 (2006)).
Additional approaches to image classification include using a Two-Dimensional (2D) hidden Markov model (see J. Li and J. Z. Wang, Automatic Linguistic Indexing of Pictures by a Statistical Model Approach, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 10 (2003)), as well as a wavelet coefficients representation of features with hierarchical Dirichlet process hidden Markov trees (see J. J. Kivinen et al., Learning Multiscale Representations of Natural Scenes Using Dirichlet Processes, IEEE 11th International Conference on Computer Vision (2007)). Still further image classification systems divide an image into a rectangular grid and classify the proportion of “material” (i.e. category of content) in each grid cell (see, e.g., J. Shotton et al., Semantic Texton Forests for Image Categorization and Segmentation, IEEE Computer Vision and Pattern Recognition (2008); J. Vogel and B. Schiele, Natural Scene Retrieval Based on a Semantic Modeling Step, Conference on Image and Video Retrieval (CIVR) (2004); etc.). In these systems, the occurrence of each material over the image is computed and the image is classified based on the resulting material occurrence vector.
Regardless of the specific approach, conventional image classification systems are ill-equipped to classify videos or portions of videos. Conventional systems are designed to analyze individual images in which care is taken to carefully frame the subject of the image (i.e. the scene) in a clear manner, whereas videos typically include a variety of types of images or frames, many of which are blurry or contain occluded portions. Additionally, the features used in single-image classification systems are often designed for narrow and particular purposes, and are unable to identify and classify the wide array of content present in most videos. Further, even if conventional systems were able to classify images from a video, these systems include no defined mechanism to account for the presence of a multitude of scene types across a video or portion of video (i.e. identification or classification of a single image or frame in a video does not necessarily indicate that the entire shot within the video from which the frame was extracted corresponds to the identified image class). As used herein, a “shot” defines a unit of action in a video filmed without interruption and comprising a single camera view.
In addition to those mentioned, classification of video, or shots within video, presents further challenges because of the variations and quality of images present in most videos. In most video sequences, only part of the scene is visible in most frames. As used herein, “scene” refers to the setting or content of the image or video desirous of classification (i.e. the context or environment of a video shot) (e.g. desert, mountainous, sky, ocean, etc.). In many videos, wide-angle shots are interspersed with close-up shots. During the close-up shots, the camera is typically focused on the subject of interest, often resulting in a blurred background, thus confusing any part of the scene type that is visible. Most videos also include shots in which either the camera or objects within the scene are moving, again causing blurring of the images within the shot.
Additionally, scene content in videos often varies immensely in appearance, resulting in difficulty in identification of such content. For example, images of buildings vary in size, color, shape, materials from which they are made, etc.; trees change appearance depending on the season (i.e. leaves change color in the fall, branches become bare during the winter, etc.); snow may be present in any type of outdoor scene; etc. In addition, the subject of a video shot may be filmed from different angles within the shot, causing the subject to appear differently across frames in the shot. Thus, because video often represents wide varieties of content and subjects, even within a particular content type, identification of that content is exceedingly difficult.
Further, use of raw or basic features, which are sufficient for some conventional image classification systems, are insufficient for a video classification system because videos typically include a multiplicity of image types. For example, the color distribution may be the same for a beach shot with white sand as for a snow-covered prairie, or an ocean shot compared to a sky shot, etc. Additionally, the mere detection or identification of a color or type of material in a scene does not necessarily enable classification of the scene. For example, a snow-tipped mountain covered with forest has a similar distribution of materials and colors as a close-up view of evergreen trees emerging from a snow-blanketed base. Accordingly, the use of strong features, as well as the spatial arrangement of materials identified by those features, is helpful in labeling the wide variety of images in video to enable accurate classification of shots within video.
One system that attempts to overcome the previously-described hurdles in order to classify videos is the “Vicar” system, described in M. Israel et al., Automating the Construction of Scene Classifiers for Content-Based Video Retrieval, MDM/KDD'04 (2004). The Vicar system selects one or more representative or “key” frames from a video, and divides each of the key frames into a grid. Each grid cell is further divided into rectangular “patches,” and each patch is classified into a general category (e.g. sky, grass, tree, sand, building, etc.) using color and texture features and a k-Nearest Neighbor classifier. The frequency of occurrence of each category in each grid cell is computed and used to classify the overall image. This system infers that if a representative frame or frames comprise a certain type of image, then the entire shot or video likely corresponds to the same type, and is thus labeled accordingly.
The Vicar system, however, has many drawbacks that produce inconsistent results. For example, selection of key frames is a relatively arbitrary process, and an easily-classifiable frame (i.e. clear, non-occluded, etc.) is not necessarily representative of the scene type(s) associated with a shot or video from which the frame was selected. Further, the key frames are partitioned based on a predetermined grid, such that resulting grid cells may (and often do) contain more than one category, thus leading to confusion of scene types. Also, the color and texture features used in the system are relatively weak features which are inadequate for classifying many categories of images. Additionally, the inference that a key frame or frames adequately and accurately represents an entire sequence of frames does not take into account variations in shots, especially for long or extended shots in videos.
Video classification has many practical uses. For example, accurate and efficient classification of video, or shots within video, enables content-based video indexing and retrieval. Such indexing and retrieval is useful for cataloguing and searching large databases of videos and video clips for use in promotional advertisements, movie and television trailers, newscasts, etc. Additionally, by classifying videos and thus narrowing the scope of videos that contain certain subject matter, processing times and accuracy of other, related image or video analysis algorithms is improved. Further, identification and classification of disparate shots within a video enables shot boundary detection and indexing associated with the video.
For these and many other reasons, there is a continuing need for a system or method that accurately and efficiently classifies shots in video based on a plurality of images or frames associated with the shot. There is a further need for a system that is able to classify shots as belonging to multiple classes of scene types, and identify particular timecodes within the video shot at which scene classes vary.