1. Field of the Invention
The present invention relates to the field of speech processing and, more particularly, to a noise playback enhancement of prerecorded audio for speech recognition operations.
2. Description of the Related Art
Speech recognition engines convert audio containing speech into textual equivalents of the speech. Accuracy in performing the speech-to-text conversions is crucial to success of a speech recognition engine. Accuracy of a speech recognition engine is typically evaluated by feeding prerecorded audio into a speech recognition engine.
Behavior of a speech recognition engine when fed the prerecorded audio can depend upon a listening mode of the engine. Common listening modes include a push-to-talk mode, a push-to-activate mode, and an always-listening mode. In a push-to-talk mode, a user explicitly notifies a speech recognition engine when to start and stop listening to speech. Speech provided between the “start” and “stop” points is speech-to-text converted, while other speech is ignored by the speech recognition engine. In a push-to-activate mode, a user notifies a speech recognition engine when to start listening to speech, but the speech recognition engine is responsible for detecting the end of speech. In an always-listening mode, a speech recognition engine is responsible for automatically detecting when to start listening and when to stop listening to speech.
A speech recognition engine operating in a push-to-activate or an always-listening mode typically relies upon some amount of nonspeech audio, referred to as “noise,” to detect the end of speech or the end of an utterance. The amount of “noise” that must follow an utterance in order for an end of utterance detection to occur is not deterministic.
When testing, measuring or training a speech recognition engine, audio streams containing leading and trailing noise suitable for each mode must be used. The reason is that when a trailing noise segment is insufficiently long to generate an end of utterance event, a corresponding speech utterance is not properly handled. Thus, a speech recognition engine in a push-to-activate or an always-listening mode is unable to be accurately tested/measured/trained using prerecorded audio that includes an insufficient amount of trailing “noise” after each speech utterance.
To ensure accurate and repeatable results, pre-recorded audio files are typically used. Unfortunately, the costs in obtaining, storing and utilizing audio recordings for the purposes of testing, measuring or training a speech recognition engine can be directly proportional to a length of the recordings. Prerecorded audio containing utterances and corresponding trailing “noise” segments that are sufficiently long for one mode may be unnecessarily long for another, and can result in needless delays when transferring the audio streams to and from the devices under test. These delays may be significant when tens of thousands of audio files are used. On the other hand, tailoring prerecorded audio streams for each mode significantly increases storage requirements since each tailored file, except for the leading and trailing noise, is basically a duplicate.
One conventional solution to the above problem is to record and store a “noise” recording for each speech recording, where the noise recording is of sufficient length for any speech recognition engine to detect an end of utterance. This solution is disfavored as it is expensive to produce and store a noise recording having a “safe” duration for each speech recording. Consequently, most prerecorded audio used for speech recognition engines includes a minimal amount of “noise.”
Another conventional solution is to alternatively send two different audio feeds to a speech recognition engine. The first feed containing prerecorded speech utterances with minimal noise between utterances and the second feed containing pure “noise.” Notably, the first audio feed can be formed using one or more audio files, each file containing at least one utterance.
A first speech utterance from the first feed is played to the speech recognition engine, then the first feed is paused and the second feed is started. The second feed, or noise feed, is played until either an end of utterance event or a time-out event occurs. Then the second feed is stopped and the first feed is played for the second speech utterance. The process repeats with the same “noise” feed being used for each utterance.
Tests/measurements/training based upon a single “noise” feed does not provide realistic results in all cases, as speech recognition engines used in real world environments must handle many different types of noise. Additionally, this solution can require all prerecorded speech utterances to be normalized to the same level as the noise recording. Normalizing the utterances can be expensive and can also introduce errors that decrease result reliability.