1. Technical Field
The technical field relates generally to transcription of content and, more particularly, to systems and methods for automated delivery of transcription products.
2. Background Discussion
Media content (for example video and audio content) is becoming increasingly prevalent as a medium on the internet. Much of this content is time sensitive and transient. For example, news and popular culture videos often receive the vast majority of internet accesses in the first few days (if not hours) after they are posted online. Also, some educational materials, including video and audio recordings of lectures, are posted on the internet and intended to be consumed almost immediately by students.
For the hearing impaired, individuals with attention deficits, and non-native speakers of the language in which the video/audio content is recorded, this content presents significant challenges. Legislation and regulations often mandate that this content be made accessible to this population of consumers. Typically, content providers make available transcripts and captions of this content to assist this population, and to more generally (e.g. even with non-impaired users) increase engagement with the online media. Time-coded transcriptions of the content also make possible advanced capabilities such as the interactive transcript plugins and archive search plugins provided by 3Play Media of Cambridge, Mass. Additionally, associating transcripts, descriptive summaries, and internet search keywords with the media content can increase the chances that the content will be found by search engines such as the GOOGLE search engine.
However, given the time-sensitive and transient nature of this content, making transcripts, captions, and these related products available in time for them to be of use for this population is difficult and costly. Often, associating some form of transcription with the media within hours is desired. This can be accomplished using fully-automated means to produce a transcript, but it is well known that such automated transcripts are often too replete with errors to be of much use to the target population. This is particularly true when the automation is real-time automatic speech recognition (ASR), since the state of the art of real-time ASR is restricted by computational constraints (e.g. CPU and memory availability) in achieving accuracy. Moreover, once real-time ASR transcripts have been made publicly available for media, it is impractical to modify these using, for example, confidence thresholds on the ASR quality. In particular, since real-time ASR systems typically output words, phrases or sentences in synchronization with the receipt of the audio, these systems are not amenable to automated modification of the output at a later time. Further, when real-time ASR systems do provide a confidence metric, this metric is typically used to modify the visual appearance of the text (e.g. by coloring unconfident words or phrases differently than confident regions) and not to prevent display of the unconfident sections of the transcript.