The invention relates to the field of machine learning.
Multi-modal representation of data may be beneficial in several machine learning tasks, such as image captioning, visual question answering, multi-lingual data retrieval, and electronic document classification. This is because, in many instances, an amalgamation of multiple views of an input sample is likely to capture more meaningful information than a representation that accounts for only a single modality. For example, in the task of scene recognition in a video, video data generally is comprised of video frames (images) along with audio. Images and audio thus comprise two different representations of the same input sample, each with different representative features. By combining these two modalities into a common subspace, classification of abstract scenes from the video can become more accurate.
The foregoing examples of the related art and limitations related therewith are intended to be illustrative and not exclusive. Other limitations of the related art will become apparent to those of skill in the art upon a reading of the specification and a study of the figures.