1. Technical Field
The present disclosure relates to transcription and more specifically to aligning transcription accurately with speech.
2. Introduction
Many live television or radio shows include closed captioning, or text messages appearing at the bottom of the screen to assist the hearing impaired in understanding the show. Closed captions can also be of use in loud areas where people can not hear the audio clearly, such as televisions in noisy environments like a bar or in a busy airport terminal. Closed captions are typically manually generated in one of two ways.
The first way is to manually transcribe what is said in real time or near real time. This approach introduces a delay of several seconds or longer because even the fastest human transcriptionist must first hear and understand the speech, then type a transcription of the speech. While faster transcription times reduce the delay, the delay still exists. This delay causes problems for searching and indexing because searches matching a caption return a later point in the media than the actual matching speech. Further, this delay can cause confusion or can be an annoyance for viewers who see a video on the screen and read text captions that are out of sync.
The second way to manually generate captions offline. This approach can be used in the case of pre-produced television shows, movies, etc. In this case, instantaneous alignment of a transcription with its corresponding position in the offline media is not necessary because a user can pause, rewind, fast forward, etc. through the offline media. This approach is potentially more accurate than transcriptions in real time, but can also be very time-consuming and expensive.