The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for performing captioning of multimedia content, e.g., content comprising both audio and video tracks, using acoustic profiles derived from social network sources.
Captioning of audio and/or multimedia content is widely used to generate a text alternative to an audio track. The resulting text alternative can be used to perform various types of analysis, such as classification of the content, searching of the content, and the like. To achieve such captioning, Automatic Speech Recognition (ASR) is often used. ASR, also known as “speech recognition,” “speech to text,” “computer speech recognition,” and the like, utilizes personalized speech profiles, typically obtained through training and configuration of the ASR system, to recognize spoken words in an audio track and correlate those spoken words to a text equivalent. The training of such ASR systems involves an individual speaker reading sections of text into the ASR system with the ASR system capturing the speech patterns of the individual speaker to generate a data representation of these speech patterns which can later be used as a basis for analyzing speech input by performing, for example, a pattern matching or the like.
The personalized speech profile for a speaker may include a variety of information to configure the ASR system for better quality of results. Such information may include, for example, data representing the voice and speaking style of the speaker (e.g., pronunciations and idiomatic phrases), background noises (e.g., fan, the hum of air conditioning, or other office sounds) for a normal voice environment, region-specifics for local accents and phrases (e.g., English-U.S., English-British, English-Australian, or English-Indian, and business domain such that the ASR system can use a domain-specific vocabulary (e.g., a vocabulary specializing in medical or legal terminology).
While ASR systems work well for controlled environments, ASR does not work well for audio and/or video captioning where the environment in which the audio track is captured is not known beforehand. That is, taking a video segment recorded at an outside location, as an example, the video segment will include not only the visual data but also the audio tracks corresponding to the visual data. In such a situation, ASR systems cannot be configured using known mechanisms because the quality of the audio track, the speakers involved in the audio track, as well as the background audio in the video are all unknown beforehand. A video may contain more than one speaker speaking on different subjects with different background noises, for example, which makes the static configuration of an ASR unusable or at most problematic with regard to the quality of the results that are obtained.