1. Field of the Invention
The present invention relates to a method for automatic annotation of digital media.
2. Description of Related Art
Substantial current multimedia analysis research focuses on information retrieval for video content. Companies such as Yahoo! and Google are extending their text-based search capabilities to video data. The applicability of these systems for content indexing is often limited to video with available time-aligned text, or relies on the link structure and text of web pages containing the videos. Video retrieval has also been the focus of the highly successful TRECVID workshops (P. Over, T. Ianeva, W. Kraaij, A. Smeaton, TRECVID 2005, “An Overview”, Proc. TRECVID 2005, http://www.nlpir.nist.gove/projects/tvpubs/tv5.papers/tv5overview.pdf last visited Nov. 1, 2006). In the TRECVID evaluations, the use of visual information is emphasized, however extracting semantics from visual data in the absence of textual descriptors remains a major open problem.
Recent work to address this semantic gap has been concentrated on ontology-based approaches to semantic feature extraction. (A. Hauptmann. Towards a large scale concept ontology for broadcast video, in Proc. Of the Third Conf. on Image and Video Retrieval, ser. Lecture Notes in Computer Science, vol. 3115. Springer, pp. 674-675, 2004 and L. Hollink, M. Worring, and G. Schreiber, Building a visual ontology for video retrieval in MULTIMEDIA '05: Proceedings of the 13th annual ACM international conference on Multimedia. ACM Press, 2005). In the ontology-based approaches, a “basis” set of binary classifiers are built to determine if a video shot exhibits a specific semantic feature. These classification outputs are combined statistically to provide higher-level analysis and enhance indexing and retrieval. Many of these approaches operate at the shot-level following an initial segmentation. This is desirable for computational efficiency, dynamic analysis of local sets of frames, and for extraction of semantic features that exhibit some temporal duration.
Manual tags are now proliferating on various shared video and image data sites such as Flickr and You Tube. While this information is of tremendous value for video indexing, including for refining and training automatic systems, it also exhibits a number of shortcomings. For example, lengthy videos can have tags that apply only to a small (sometimes unidentified) portion of the video. Also, the classic problems of polysemy and synonymy described in the text categorization context are inherited in aggregating tag data for multimedia categorization (M. W. Berry, S. T. Dumais, and G. W. O'Brien, “Using linear algebra for intelligent information retrieval”, SIAM Review 37(4):573-595, 1995).