Despite improvements in recent years in the area of automatic speech recognition (ASR), fully automated transcription with ASR systems has yet to provide sufficient accuracy levels comparable to the accuracy achievable with manual transcription. However, manual transcription is both heavily time-consuming and expensive, so hybrid transcription approaches have emerged. In hybrid transcription, a human transcriber goes over a transcription generated by an ASR system, and corrects errors found in the transcription. Thus, hybrid transcription can leverage the speed of automated transcription with a boost to accuracy from human reviewing.
Since reviewing transcriptions involves listening to the audio, possibly multiple times for certain segments, even hybrid transcription approaches can take quite a while. This can become problematic when large audio files need to be transcribed quickly, such as same-day transcription of hours-long legal depositions. In order to reduce the time it takes to complete large transcription jobs, multiple transcribers may work simultaneously on different segments of a large audio file. However, such work in parallel may become inefficient and even introduce inconsistencies into the generated transcription.
Often transcription errors occur when uncommon terms (e.g., names or phrases) are mentioned in the audio. Such terms, which have a very low frequency in general audio, are often difficult to resolve both for ASR systems and for humans. An ASR system will often get such uncommon terms wrong, and then it is up to a human transcriber who reviews the transcription generated by the ASR system to resolve the correct transcription of these terms. Such a resolution, which may also involve determining the correct spelling, may be difficult and time consuming, and possibly involve research by the transcriber reviewing the transcription.
Since such uncommon terms may be used throughout a large audio file, when multiple transcribers work in parallel on different segments, there is a danger that many of them will encounter the same problematic terms. Each transcriber may waste time trying to resolve these uncommon terms, possibly reaching different results (e.g., different transcribers may spell the same name differently). This is both wasteful in terms of man hours that could be saved, and contributes to inconsistencies, since when transcriptions of the various segments are coalesced, utterances of the same phrase may be transcribed differently, and be represented by different terms or different spelling for the same terms. Thus, there is a need for a way to share knowledge among transcribers working in parallel on a large audio file.