Being able to rapidly and accurately transcribe long audio recordings, such as same-day transcription of multi-hour legal depositions, is a very challenging task. One approach used to tackle this challenge is hybrid transcription, which involves automatic speech recognition (ASR) systems generating initial transcriptions that are then reviewed by human transcribers in order to correct various errors in the transcriptions. However, despite advances in ASR algorithms, there are still typically many errors in the transcriptions, especially when low-probability words are used in the audio (e.g., uncommon names or technical terms), when a speaker does not speak clearly or in an unfamiliar accent, or when the audio quality is bad. In such scenarios, it falls on the human transcribers to correct the many deficiencies in the automatically-generated transcriptions.
However, even a human transcriber may have difficulty resolving some of the challenging utterances an ASR system was overwhelmed by. The human transcriber may need to spend a lot of time listening to a specifically challenging utterance that was unclear or spend a lot of time researching the correct spelling of a certain name or technical term. This can take a lot of time, which can both greatly increase the expense of the transcription and the time it takes to complete it. Thus, there is a need for a way to assist human transcribers to resolve difficult utterances they encounter, in order to reduce the cost and turnaround time involved in generating transcriptions.