This application relates to joint acoustic and visual processing, and more particularly relates to processing related audio and image data for purposes including cross-modality query and retrieval.
Conventional automatic speech recognition (ASR) is generally makes use of large amounts of training data and expert knowledge. These resources may take the form of audio with parallel transcriptions for training acoustic models, collections of text for training language models, and linguist-crafted lexicons mapping words to their pronunciations. The cost of accumulating these resources is immense, so it is no surprise that very few of the more than 7,000 languages spoken across the world support ASR (at the time of writing the Google Speech API supports approximately 80).
Some approaches to speech recognition attempt to make use of speech that has not be transcribed or otherwise annotated according to its content. For example, some approaches attempt to infer the set of acoustic units (e.g., analogous to phonemes). In recent years, there has been much work in the speech community towards developing completely unsupervised techniques that can learn the elements of a language solely from untranscribed audio data. For example, some approaches enabled the discovery of repetitions of the same word-like units in an untranscribed audio stream.
Completely separately, multimodal modeling of images and text has been addressed in the machine learning field during the past decade, with many approaches focusing on accurately annotating objects and regions within images. For example, some approaches rely on pre-segmented and labelled images by their content to estimate joint distributions over words and objects.
Humans learn to speak before they can read or write, so why can't computers do the same?