Image/video classification involves categorizing a collection of unlabeled images into a set of predefined classes for semantic level image retrieval. In some approaches, images are modeled by segmenting the image into patches. Then, the patches are compared to a reference image based on aspects of each patch, such as color, texture, etc. An additional factor that may be considered in image classification is the spatial context between the local patches of images. Spatial-contextual models attempt to depict the spatial structures of images in a class by constructing one common model for each image category.
In one example, a two dimensional Hidden Markov Model (2D HMM) may be used for image categorization, by generating a learned model from a training set of images for each image class. Then, the learned model is used to score the probability of an unlabeled image belonging to a certain class of images. However, a subject image category may have a large intra-class variance, making it is difficult to represent various spatial contexts in different images using a single model. For example, the images for a specific category may differ by view, such as top view, side view, front view and back view. Each view may have a different spatial context related to its respective local patches. These differences may reduce the depictive ability of a single model to capture a large intra-class variance between images.