The present disclosure relates generally to audio-based display indicia in media playback and, more specifically, to methods, systems, and processes for multimodal speech recognition processes for real-time audio-based display indicia application for media content.
A video or other media may include audio-based display indicia, such as subtitles or closed captions. The subtitles or closed captions can provide a translation or a transcript of the spoken dialogue and/or sounds in the content of the media that is played back, such as a video, and, optionally, the audio-based display indicia may contain other information to provide context and/or indicia to a viewer regarding the content of the media. Closed captions may be useful to hearing impaired viewers. Subtitles may be useful for viewing foreign language videos or for viewing videos in a noisy environment.
Live captioning may be performed manually, with a person or operator listening and recognizing the spoken words in content and typing in real-time. Other solutions may involve general-purpose automated transcription of speech in real-time. Variations in media types, content, etc. may provide challenges for automated solutions from being viable or effective.