In a computerized information retrieval application, a user may desire to locate portions of an audio file, e.g., a taped radio program, that have specific content. If the retrieval application has an aligned text transcript of the audio file, then the text file can be searched using conventual text query techniques to locate the corresponding portion in the audio file. In effect, the alignment enables direct access into the audio file by words. Audio-to-text alignment can also be used to search a video file (video) when the video includes an audio stream aligned with a text transcript, e.g., the video signal is closed captioned.
Most known alignment methods are extensions of conventional computerized speech recognizers that operate in a very restricted mode to force recognition of the target text. Typically, the alignment is done in a left-to-right manner by moving a recognition window forward in time over the audio signals. The width of the window, measured in time, can be large enough to allow the recognizer to recover from local errors. This type of alignment is probably better characterized as forced recognition.
The problem of forced alignment is different from the problem of recognition. In the case of recognition, the words said are unknown and the task is to recognize the spoken words. With alignment, the text is known but the time alignment of the text with the spoken words of the audio stream is unknown.
Therefore, methods based on forced recognition have quite a few drawbacks and limitations. For example, those methods perform poorly with noisy or other difficult audio streams, as in the case where non-speech audio signals are overlaid with the spoken words. In addition, when the audio stream is long, e.g., an hour or more, there is a very high probability of a gross error in the alignment. Because these methods are based on a single left-to-right pass over the audio stream, a single error early in the pass can cause the remaining stream to be misaligned. Furthermore, such methods might not work at all if the text does not represent the entire duration of the audio stream, but only part of it.