The goal of data fusion is to obtain a discriminative representation of observations that are generated from multiple heterogeneous information sources (commonly called multi-view data or multimodal data). Multimodal data provides rich information that may not be obtained through a single input source alone. Data fusion has numerous real-world applications, such as multimodal sentiment analysis, multimedia information retrieval, and enterprise social network analysis. The machine learning community wishes to leverage the richness of information contained in multimodal data in order to understand complex statistical relationships among the variables of interest and to improve performance on various real-world prediction tasks.
One such real-world application of data fusion is in the automatic recognition of human emotion. Human emotion recognition has numerous applications in intelligent user interfaces and human-computer interaction. Building a system for automatic recognition of human emotion, however, presents several notable sub-challenges—from signal acquisition to processing, to the interpretation of the processed signals. Part of the difficulty comes from the fact that humans express emotions via multiple modalities, such as facial expressions, body gestures, and voice (both via its verbal content and prosody). This raises questions about when and how to combine the data from these multiple modalities, a problem commonly referred to as multimodal data fusion. Dealing with multimodal data is difficult for many reasons. First, each modality may have distinctive statistical properties (e.g., means and variances), while a set of modalities may exhibit shared statistical patterns (e.g., correlation and interaction). Thus, it is important to model modality-private and modality-shared patterns appropriately. Also, modalities can be redundant, complementary, and/or contradictory. Further, only some modalities may have useful information at a given time, and the structure of relevant modalities may change from time to time. Thus, techniques are needed to learn some structural information of the modalities and adaptively select only the relevant modalities (or subsets of variables from each modality).
Recently, there has been a surge of interest in using a sparse representation of input signals for multimodal data fusion. This technique assumes that an input space is inherently sparse, and represents the input as a linear combination of over-complete dictionary atoms, with a constraint being that the linear coefficients have to be sparse, i.e., only a few atoms are used to reconstruct the input signal. The dictionary is either manually constructed (e.g., wavelets) or automatically learned from the data, however recent studies have shown that dictionary learning has several advantages over manual construction of a dictionary, with improved performance in real-world problems.
In the unimodal setting, Meinshausen et al., “High Dimensional Graphs and Variable Selection with the Lasso,” The Annals of Statistics 34.3, 1436-1462 (2006) proposed an approach based on LASSO to automatically select relevant variables out of redundant ones. That work however did not assume any group-structure among the input variables, making it unsuitable for multimodal data fusion. For a discussion of LASSO see R. Tibshirani, “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society Series B, 58(1), 267-288 (1996) (hereinafter “Tibshirani”).
In the multimodal setting, Jia et al., “Factorized Latent Spaces with Structured Sparsity,” Twenty-Fourth Annual Conference on Neural Information Processing Systems, 982-990 (December 2010) (hereinafter “Jia”) applies group sparsity for learning a multimodal dictionary where some dictionary atoms correspond to the subspace that is shared across views (modality-shared latent subspaces), while other atoms correspond to the subspace private to each view (modality-private latent subspaces). It is proffered in Jia that the sparse representation obtained using their method shows superior performance in multi-view human pose estimation, which led them to suggest that it captures complex interdependencies in multimodal data. However, while the techniques presented in Jia take into account the group structure of the input variables (i.e., view definitions), they cannot incorporate the structural information among the groups (i.e., view-to-view relations).
Therefore, techniques for multimodal data fusion that can leverage this information about the structure among the groups would be desirable since this is necessary to achieve the sparsity over multiple heterogeneous input sources so as to enable selecting a few combinations of input views that best explains the task at hand.