Video stories such as those found in the video news domain typically are reported as separate episodes reported over time and over different channels, with each episode comprising a sequence of related video segments and each video segment comprising video imagery taken from a particular vantage point. A first step to the effective indexing and retrieval of all related episodes across all times and all channels is the automatic annotating of their individual video segments with feature labels that describe the visual features shown in the video segments. A number of efforts have been made to derive and evaluate visual feature ontologies for use in labeling video segments. Perhaps the most well-developed ontology is the Large Scale Concept Ontology for Multimedia Understanding (LSCOM) described in, for example, A. G. Hauptman, “Towards a Large Scale Concept Ontology for Broadcast Video,” Proceedings of International Conference on Image and Video Retrieval, July 2004, pp. 674-675.
Nevertheless, the automatic annotation of video segments with feature labels remains inexact. One method of measuring the precision of such labeling is Average Precision (AP). AP is defined as the average of the instantaneous precisions of a sequence of experiments. Each experiment retrieves new candidate video segments one by one until a new correctly labeled segment is found. What is considered correct is determined by reference to feature labels manually assigned to the video segments by one or more persons who have previously viewed the video segments. Instantaneous precision is then defined as the number of correctly labeled video segments (which increases by exactly one at each step) divided by the total retrievals in all experiments so far (which includes all the errors of this and all prior experiments). Early errors of retrieval therefore continue to severely penalize subsequent experiments. Some visual features, such as “Person.” “Face,” and “Outdoor” can be detected in isolated video segments with much greater than 90% AP. However, AP quickly drops as features become less common, in part because less training data is available. For example, the AP for “Building” is typically less than 50%, and most of the rarer visual features, such “Police-Security” or “Prisoner,” typically have an AP in the low single digits.
As a result, there is a need for methods and apparatus for improving precision in the automatic annotation of video segments with feature labels that indicate the visual features shown in the video segments.