1. The Field of the Invention
The present invention relates to systems and methods for transcribing speech. More particularly, the present invention relates to systems and methods for recognizing or transcribing very large vocabulary speech in real-time.
2. The Relevant Technology
Closed captioning is a technology used in video or television programming that displays a text equivalent of the speech spoken during the video or television programming on a monitor such as a television screen. By displaying text, closed captioning permits a television program to be understood in situations where the audio cannot be heard. For example, noisy environments such as airports often enable closed captioning because of the difficulty in hearing the program. Closed captioning is also useful for individuals who are hearing impaired.
The closed captioning text is usually included in the video signal and is typically decoded and displayed to users concurrently with the television program or other video. For many television programs, the speech is known in advance and it is relatively easy to generate the text that corresponds to that speech and encode the text in the video signal. Many television programs, for example, are taped before they are broadcast. In this situation, the speech included in the television programs can be converted to text and inserted in the television program before it is aired. Preparing the closed captioning text is not subject, in this situation, to any stringent time restraints.
However, there are many situations where the speech is not known in advance and it is much more difficult to provide closed captioning in these situations. Broadcast news programs, for example, suffer from this problem. Broadcast news programs are often broadcast live and the speech of the broadcasters is therefore not usually known beforehand. The speech must be transcribed as it is spoken before it can be inserted in the video signal and displayed to end users using closed captioning technology. Because of these time constraints, closed captioning text frequently includes errors and there is often a delay between the speech and the corresponding transcribed text.
Sometimes, transcribing the speech to closed captioning text is performed by humans. Alternatively, automatic speech recognition (ASR) systems are used to aid or perform the speech transcription. One goal of these ASR systems is to transcribe the speech in real-time. This is a formidable task for a variety of reasons. For example, each broadcaster has a different voice and training an ASR recognition system with a large number of different voices is time consuming. Another factor that makes real-time speech transcription difficult is that the speech included in broadcast news programs has a very large vocabulary. These and other factors combine to make real-time speech transcription difficult to achieve. Those systems that have achieved real-time transcription typically have unacceptable error rates or other latency problems not directly related to the speech transcription.
For these reasons, a considerable amount of research has recently been performed in Broadcast News Transcription and the DARPA Hub 4 evaluations are examples of this research. The hub systems, however, were developed to maximize accuracy and usually run in excess of 100 times real-time (100×RT). A 100×RT system requires 100 seconds to process or transcribe 1 second of speech. There are also 10×RT systems that can run faster than 10×RT with small degradations in accuracy.
As transcription systems using ASR approach real-time operation, the accuracy begins to degrade and word error rates increase. There is therefore a large margin for improvement in real-time speech transcription. In fact, much progress needs to be made before automatic speech recognition systems can replace humans for real-time closed captioning, where the human error rate is approximately 10%.
Another problem that impacts the ability of a transcription system to transcribe speech in real-time is latency. Latency is often defined as the delay between the input and output of the speech recognition system. Thus, real-time performance of speech recognition does not guarantee low latency. For instance, a speech recognition system may have a preprocessing component that operates on the whole input. In this case, the latency is related to the length of the input, regardless of how fast the speech recognition system can transcribe the speech.
In another instance, latency can be introduced by segmentation of the speech being recognized and transcribed. A segmenter, for example, generates small segments that are fed to a front-end. If a segmenter operates on an entire show, then the latency is the length of the show. Often, segmenters use a distance metric between two adjacent speech windows. In this case, the latency is at least the distance metric between speech windows, which is typically 2 to 4 seconds.