Semantic understanding of a visual media, such as an image or a video capture of a scene, requires recognition of various semantic concepts associated with the scene. In such applications, recognition refers to any one of detection, classification, categorisation or a localisation operation, while a semantic concept refers to objects, actions, interactions and events which are associated with the scene. The categorisation of the scene, such as home, office, street view, beach or park, is also a semantic concept.
Semantic concepts associated with a scene may be related. For example, the types of objects and actions expected to appear in an indoor home scene are different to the types of objects and actions expected for an outdoor beach scene or a train station. Thus, it is desirable to exploit the relationship between various concepts associated with a scene. One type of approach to exploiting such relationships is to use a probabilistic model of a scene and jointly recognise the various concepts associated with the scene.
Recognition of various semantic concepts in an image requires different feature types, such as Space-Time Interest Point (STIP), Histogram of Oriented Gradient (HOG), Scale Invariant Feature Transform (SIFT), colour descriptors such as red, green, blue (RGB) histogram, colour moments, HSV SIFT (Hue, Saturation, Value, a cylindrical colour space; Scale Invariant Feature Transform). This is because different feature types are designed to be more discriminative for different attributes of image data. Further, different feature types may provide different levels of invariance to variations in the image data. Using a combination of different feature types allows a system to take advantage of the differences in discriminative and invariant properties of the feature types when inferring semantic concepts. As a result it is often advantageous to use multiple feature types even when recognising a single semantic concept in an image.
One recognition approach is to determine all the desired features and build a complete model before starting the inferencing process. Determining all the features, however, may be computationally expensive, as feature calculation may consume a significant portion of available computation resources. Determining all desired features may also significantly delay the recognition of semantic concepts in the image. When computation resources are limited, or fast recognition is required, recognition needs to be performed with a limited selection of feature types. Dynamically selecting and determining the features types which are used for joint recognition of various semantic concepts may also be beneficial when the recognition time or budget is not known ahead of time.
Another approach for selecting feature types for classification is to use a decision tree. Decision tree classifiers may be built with a consideration of the feature cost and measure the computational complexity of calculating a feature of a selected feature type. Decision trees, however, are not appropriate for exploiting the dependencies between different concepts. As a result, probabilistic joint classifiers often have better classification performance than decision trees.
An ensemble of classifiers built for a classification task may also be modified to include a feature cost which provides an estimate of the time required to calculate features of a particular type. Using an ensemble classifier, it is possible to select the classification feature types at inference time. However, similar to decision trees, ensemble classifiers do not exploit correlation between different concepts, and therefore often have inferior classification performance compared to probabilistic joint classifiers.
Due to interdependencies between different associated concepts, selecting and prioritising the features which are the most beneficial for a given joint recognition task is a difficult problem. For instance, content interdependency could affect the classification results differently, depending on the concepts which exist in a scene.
Thus, there exists a need for a method for dynamically selecting recognition features while also exploiting the interdependencies between different concepts.