1. Field of the Invention
The invention relates in general to classification of objects based upon information from multiple information sources and more particularly, to modal identification and classification fusion.
2. Description of the Related Art
Many real-world objects such as biometric and video may be represented by features from multiple modalities. For example, traditionally, videos are represented by features embedded in the tracks of visual, audio and caption text. Also, for example, biometric data may be collected from multiple sources including face, fingerprint, voice and DNA to identify a person. These features are extracted and then fused in a complementary way for understanding the semantics of a target object.
Unfortunately, there have been shortcomings with prior approaches to modality identification. A first approach to modality identification uses only one dimension, and does not require the fusion step. A second approach to modality identification treats each information source as one modality, and does require a fusion step.
The first approach to modality identification which uses only one dimension may suffer the “curse of dimensionality”. Goh et al., Svm binary classifier ensembles for multi-class image classification, ACM International Conference on Information and Knowledge Management (CIKM), 2001, used the raw color and texture features to form a high-dimensional feature vector for each image. Recently, statistical methods such as Principal Component Analysis (PCA) and Independent Component Analysis (ICA) have been widely used in the Computer Vision, Machine Learning, Signal Processing communities to denoise data and to identify independent information sources. See M. L. Cascia, et al., Combining textual and visual cues for content-based image retrieval on the world wide web, IEEE Workshop on Content-based Access of Image and Video Libraries, 1998; L. Hansen, et al., On independent component analysis for multimedia signals, Multimedia Image and VideoProcessing, CRC Press, 2000; T. Kolenda, et al., Independent component analysis for understanding multimedia content, In Proc. of IEEE Workshop on Neural Networks for Signal Processing, 2002; A. Vinokourov, et al., Learning the semantics of multimedia content with application to web image retrieval and classification, In Proceedings of Fourth International Symposium on Independent Component Analysis and Blind Source Separation, 2003; and T. Westerveld, Image retrieval: Content versus context. Content-Based Multimedia Information Access, RIAO, 2000. In the multimedia community, it has been observed that audio and visual data of a video stream exhibit some statistical regularity, and that regularity can be explored for joint processing. See J. Hershey et al., Using audio-visual synchrony to locate sounds, Advances in Neural Information Processing Systems 12, MIT Press, Cambridge Mass., 2001; and F. J. W. III, et al., Learning joint statistical models for audio-visual fusion and segregation, Advances in Neural Information Processing Systems 13, MIT Press, Cambridge Mass., 2000. Smaragdis et al, Audio/visual independent components, International Symposium on Indepdendent Component Analysis and Blind Source Separation, 2003, proposed to operate on a fused set of audio/visual features and to look for combined subspace components amenable to interpretation. Vinokourov et al., Inferring a semantic representation of text via cross-language correlation analysis, In Advances of Neural Information Processing, 2002, found a common latent/semantic space from multi-language documents using independent component analysis for cross-language document retrieval. A shortcoming of these prior teachings is that the curse of dimensionality arises, causing ineffective feature-to-semantics mapping and inefficient indexing See Y. Rui, et al., Image retrieval: Past, present, and future, International Symposium on Multimedia Information Processing, 1997. This phenomenon has been termed the dimensionality curse because it can severely hamper the effectiveness of data analysis. See R. Bellman, Adaptive control processes, Princeton, 1961.
The second approach to modality identification, which treats each information source as one modality, may suffer from inaccuracies due to inter-dependencies between sources. This second approach treats the features as m modalities, with di features in the ith modality (i=1, . . . , m). Much work in image and video retrieval analysis employs this approach. For example, the QBIC system supported image queries based on combining distances from the color and texture modalities. See M. Flickner et al., Query by image and video content: the qbic system, 1997. Velivelli et al., Detection of documentary scene changes by audio-visual fusion, In proceedings of International conference on Image and video retrieval, 2003, separated video features into audio and visual modalities. Adams et al., Ibm research tree-2002 video retrieval system, also regarded each media track (visual, audio, textual, etc.) as one modality. For each modality, these works trained a separate classification model, and then used the weighted-sum rule to fuse a class-prediction decision. This modality-decomposition method can alleviate the “curse of dimensionality.” However, since media sources are treated separately, the inter-dependencies between sources may be left unexplored.
There also have been shortcomings with fusion of classification data for different modalities. Given that D modalities have been obtained, there is a need to for D classifiers, one to interpret data for each modality. The challenge is to fuse the D classifiers to provide an overall classification. The fusion challenge is enhanced because D modalities typically are not entirely independent of each other. For example, PCA and ICA often cannot perfectly identify independent components for at least two reasons. First, well-known ICA algorithms (e.g., fixed-point algorithm, Infomax, kernel canonical analysis, and kernel independent analysis) generally require a good estimate of the number of independent components k to find them effectively. Second, ICA typically performs a best attempt under some error-minimization criteria to find k independent components. Nevertheless, the resulting components, as shown still may exhibit inter-dependencies.
Various fusion strategies for multimodal information have been presented including product combination, weighted-sum, voting, and min-max aggregation. Among them, product combination and weighted-sum appear to be the most popular fusion methods. Unfortunately, there are significant problems with each of these approaches to fusion.
The product-combination rule is an optimal fusion model from the Bayesian perspective, assuming that D modalities are independent of each other, and that posterior probability can be accurately estimated for each modality. Unfortunately, however, the D modalities likely will not be truly independent and, we posterior probability ordinarily cannot be estimated with high accuracy. The work of D. M. Tax et al., Combining multiple classifiers by averaging or by multiplying, Journal of the Pattern Recognition, 33, 2000, concluded that the product-combination rule works well only when the posterior probability of individual classifiers can be accurately estimated.
The weighted-sum strategy is more tolerant to noise because sum does not magnify noise as severely as product. Unfortunately, however, weighted-sum is a linear model, not equipped to explore the inter-dependencies between modalities. Recently, Yan and Hauptmann R., The combination limit in multimedia retrieval. ACM Multimedia, 2003, presented a theoretical framework for bounding the average precision of a linear combination function in video retrieval. The framework concluded that the linear combination functions have limitations, and suggested that non-linearity and cross-media relationships should be introduced to achieve better performance.
Thus, there has been a need for improvements to modality identification and for improvements in classification fusion. The present invention meets these needs.