Image scene recognition refers to using visual information of an image to automatically process and analyze the image, and determine and recognize a particular scene (for example, a kitchen, a street, or a mountain) in the image. Determining a scene in an image not only contributes to understanding of overall semantic content of the image but also provides a basis for recognizing a specific target and event in the image. Therefore, scene recognition plays an important role in automatic image understanding by a computer. A scene recognition technology may be applied to many practical problems, such as intelligent image management and retrieval.
In an existing scene recognition technology, visual information of an image is first described, and this process is also referred to as visual feature extraction of the image; then, matching (or classification) is performed on an extracted visual feature by using a template (or a classifier) that is aimed at a different scene and has been acquired, and a final scene recognition result is acquired.
A general method for extracting visual features is calculating statistics that represent low-level visual information in an image. These visual features include a feature that describes color information, a feature that describes texture information, a feature that describes shape information, and the like. After the low-level visual information is obtained, the features can be classified by using a classifier trained beforehand, and then a final recognition result is obtained. A main drawback of this method is that a low-level visual feature is incompetent in distinguishing different scenes, and some scenes (for example, a study room and a library) with similar information such as color and texture cannot be effectively distinguished or recognized, thereby affecting scene recognition performance.
In another existing method, a middle-level feature representation (or referred to as an “attribute”) is used to perform scene recognition. For such a method, a large number of visual concept detectors need to be designed first. Results obtained by means of detection by the visual concept detectors are concatenated to form a middle-level feature representation. Finally, the feature is classified by using a classifier, and then a final recognition result is obtained. Main drawbacks of this method include: 1. in this method, a detection result (for example, an “athlete” and a “football”) of a whole target of a labeled object is used as a middle-level feature, which has a limited description capability, and if only a part of an object is present in a scene (for example, “only a leg of an athlete is shown”), the object cannot be detected; and 2. repetition may exist in a detector set, that is, one detector is trained for each type of object that is labeled by each training image set; because meanings of some types of images may be similar (for example, a “referee” and an “athlete”), detectors obtained by training these types are repetitive or highly similar; on the one hand, a high-dimensional disaster of feature information is caused, and on the other hand, a result that is repeatedly obtained by means of detection for multiple times relatively suppresses a detection result that is rarely present, thereby affecting scene recognition performance.