Exemplary embodiments of this invention relate to editing and particularly to editing errors in transcription and/or translation.
Increased bandwidth availability for web and cell phone applications has resulted in proliferating audio and video data. The increased quantities of audio and video information result in correspondingly increasing requirements for transcription capability. Transcription of audio ensures that the multimedia materials are accessible to all users, including users that are deaf or hard of hearing. Transcription also enables users that are “situationally disabled” to gain access to needed information, for example, users with access to only low bandwidth transmission capability can read text streams even when full bandwidth video is not an option. Transcription of audio is also a prerequisite for providing a number of other high-value capabilities, such as translation, summarization, and search.
Manual transcription options remain expensive and require highly skilled and scarce labor forces such as stenographers. Automated speech transcription is steadily improving, with word error rate reductions of as much as 30% per annum on specific data types. Nonetheless, full transcription availability of unlimited domain audio materials remains a distant goal. For example, current speech automation transcription rates for broadcast news presented by a single talker are approximately 80%. Accuracy rates with multiple speakers, under degraded audio conditions, are considerably worse.
There is a gap between speech automation performance and acceptable transcription requirements for captioning. As a result, speech technology is not incorporated in captioning processes, and expensive manual procedures are chosen instead. This situation results in another gap in which most audio and video information that is generated remains untranscribed, untranslated, unsummarized, and unable to be searched.
One standard methodology for enhancing speech-automated outputs includes human editing of erred results. While promising in principle, this has demonstrated limited value. A speech-automated transcript of one hour of audio with an 80% text accuracy rate requires 5 hours of human editing, using current methods, in order to achieve perfect accuracy. Similar challenges exist for machine translation that is supplemented by human editors. The multiple hours of editing that are demanded reduce the attractiveness of incorporating automatic speech recognition or machine translation into these processes. In order to advance speech recognition and machine translation as viable options, the accuracy of these tools must increase and/or the burden and expense of editing and repair must decrease.
It would be desirable to have a bridge between what speech automation technology can currently handle, and what can best be handled through human mediation. It is also desirable to exploit the human component most efficiently and most cost-effectively, while simultaneously enhancing the speech automation technologies.