Increasingly, managing unstructured data content and information has required some type of semantic detection and indexing capability. Consequently, typical state of the art content management systems are increasingly relying on machine learning and classification techniques. These state of the art machine learning and classification techniques rely to varying degrees on human intervention to construct the detector (i.e., to teach the system how to classify) prior to use and, sometimes, during use. Also, machine learning and classification techniques may be classified as supervised, semi-supervised or unsupervised. Supervised machine learning and classification begins, for example, by iteratively classifying known examples or labeled exemplars. Semi-supervised machine learning and classification uses both labeled exemplars and unlabeled exemplars. Unsupervised machine learning and classification uses unlabeled exemplars. However, whether supervised or unsupervised, typical such techniques rely on human intervention or feedback to train the classifier to arrive at an acceptable result.
Whether supervised or unsupervised, learning and classification techniques may require considerable supervision as the semantic detector is being constructed, but that may not need a learning component during detection. Well known relevance feedback type techniques may be characterized as non-persistent lightweight binary classifiers that use incremental training to improve classification/retrieval performance. Relevance feedback classifier accuracy depends upon the number of exemplars provided, the level of feedback the classifier receives and the amount of time expended training. Statistical semantic modeling, for example, has significantly reduced the level of manual supervision needed over older relevance feedback techniques from lightweight classifiers to heavyweight classifiers. Unfortunately, with these prior art techniques training the classifier can be a time consuming and expensive proposition. So, these techniques consume large amounts of precious annotation time and require a considerable annotation effort during training to achieve acceptable annotation quality. As a result, it has become increasingly important to reduce human intervention in machine learning and classification, especially for state of the art media indexing and retrieval.
Consequently, to reduce human intervention time, disambiguation has been widely applied during annotation. Further, active learning with the system taking a pro-active role in selecting samples during annotation has maximized disambiguation and reduced the number of samples that need to be annotated by an order of magnitude. See, e.g., M. Naphade et al., “Learning to Annotate Video Databases,” Proc. IS&T/SPIE Symp. on Electronic Imaging: Science and Technology—Storage & Retrieval for Image and Video Databases X, San Jose, Calif., January, 2002). An orthogonal approach for concepts with regional support, known as multiple instance learning, accepts annotations at coarser granularity. For example, a user can build a model for a regional concept (e.g., the sky) by selecting the region in an image that corresponds to the regional label. Once the regional concepts have been selected, the system learns from several possible positively and negatively annotated examples, how to represent the concept using regional features. See, e.g., A. L. Ratan, O. Maron, W. E. L. Grimson, and T. Lozano Prez. A framework for learning query concepts in image classification. In CVPR, pp. 423-429, 1999.
Other useful tools include cross descriptor learning with multimodal semantic concept detection. See, e.g., Naphade et al, “Probabilistic Multimedia Objects (Multijects): A Novel approach to Indexing and Retrieval in Multimedia Systems,” Proceedings of IEEE International Conference on Image Processing, vol. 3, pp 536-540, October 1998, Chicago, Ill. For a semi-supervised example, where unlabeled exemplars are used in conjunction with labeled exemplars for classification, see, Naphade et al, “Classification using a Set of Labeled and Unlabeled Images,” SPIE Photonics East, Internet Multimedia Management Systems, vol. 4210, pp 13-24, Boston, Mass., November 2000. Also, unlabeled exemplars with multiple descriptors have been used with labeled exemplars in what is known as single view sufficiency. Single view sufficiency is useful when each descriptor is sufficient by itself for learning and to represent the metadata model. See, e.g., Blum et al, “Combining labeled and unlabeled data with co-training,” Proceedings of Conference on Computational Learning Theory, pp 92-100, 1998. Unfortunately, single view sufficiency requires making simplistic and unrealistic assumptions, i.e., that each descriptor in itself sufficiently represents the metadata and that all descriptors agree with each other in terms of the metadata characteristics. Descriptors for unstructured data (such as for reality based exemplars that support multiple descriptors, e.g., video, text, images and etc.) seldom satisfy single view sufficiency requirements. So, because of the constraints imposed by single view sufficiency, it has not been particularly useful on unstructured data and information. Consequently, these approaches all require some manual intervention in enriching metadata in the unlabeled exemplars, even for unlabeled exemplars that can be described using multiple descriptors.
Thus, there is a need for a system and method that is unconstrained by the restrictions of single view sufficiency and independent of the apparatus used for generating initial labels over the unlabeled exemplars and further, for a system and method for developing cross feature learning on unlabeled exemplars.