Recognition of semantic information from visual context has been an important goal for research in image and video indexing. In recent years, NIST TRECVID (National Institute of Standards and Technology—Text Retrieval Conference—Video; see “Trec video retrieval evaluation”, online at http://www-nlpir.nist.gov/projects/trecvid/) video retrieval evaluation has included a task in detecting high-level semantic features, such as locations, objects, people, and events from the image content of videos. Such high-level semantic features, termed “concepts” in this application, have been found to be very useful in improving quality of retrieval results in searching broadcast news videos.
A problem exists of enhancing concept detection accuracy. Semantic concepts usually do not occur in isolation—knowing the contextual information (e.g., outdoor) of an image is expected to help detection of other concepts (e.g., cars). For example, to detect “government leader” it is usually very hard to get a robust independent detector, because there are different persons from different views, and at different backgrounds. However, it is relatively easier to determine whether the image contains a face or whether it is an outdoor or indoor scene. These detection results form important context information which can be adopted to help detect “government leader” through a context-based model. Context-based concept fusion is such a framework to incorporate the interconceptual relationships to help detect individual concepts. It has a two-step framework. Given an input image, in the first step, independent detectors are applied to get an initial estimation about the posterior probabilities of concept labels. Then in the second step, these initial estimations are used as features to feed into a contex-based model which incorporates inter-conceptual relationships to refine the detection results.
Based on this idea, several Context-Based Concept Fusion (CBCF) methods have been proposed. The Multinet approach [M. R. Naphade, et al., “A factor graph framework for semantic video indexing”, IEEE Trans on CSVT, (12)1: 40-52 2002] models the correlation between concepts with a factor graph and uses loopy probability propagation to modify the detection of each concept based on the detection confidence of other concepts. Because the joint probabilities of pairwise concepts are used as functions on the function nodes, thus needing a large amount of data to estimate, the performance will suffer when the training samples are limited. In [SS. Pack and S.-F Chang, “Experiments in Constructing Belief Networks for Image Classification Systems”, Proc. ICIP, Vancouver, Canada, September 2000], models based on Bayesian Networks are used to capture the statistical interdependence among concepts present in consumer photographs. The Discriminative Model Fusion (DMF) method [J. Smith et al., “Multimedia semantic indexing using model vectors”, Proc. ICME, vol. 3, pp. 445-448, 2003], generates a model vector based on the detection score of individual detectors, and a support vector machine (also referred to herein as an “SVM”) is then trained to refine the detection of original concepts. In this method, there is needed an extra training set to train the context-based classifiers, and the performance will also suffer with limited training samples.
However, the results reported so far have indicated that not all concepts benefit from the CBCF strategy. As reported in A. Amir, et al., “IBM research TRECVID-2003 video retrieval system”, Proc. NIST Text Retrieval Conf., (TREC), 2003, no more than 8 out of 17 concepts gain performance improvement by using CBCF. The lack of consistent performance gain could be attributed to several reasons: (1) insufficient data for learning reliable relations among concepts, (2) unreliable detectors, and (3) scales and complexity of the concept relations. Interestingly, results in Park and Chang, supra, suggests that user provided labels are much more effective in helping inferring other concepts compared to automatically detected labels.
It would thus be desirable to provide a method for effectively improving the accuracy of semantic concept detection in images and videos.