1. Technical Field
The invention is related to automatically determining when speech begins in a signal such as an audio signal, and in particular, to a system and method for accurately detecting speech onset in a signal by examining multiple signal frames in combination with signal time compression for delaying a speech onset decision without increasing average signal delay.
2. Related Art
The detection of the boundaries or endpoints of speech in a signal, such as an audio signal, is useful for a large number of conventional speech related applications. For example, a few such applications include encoding and transmission of speech, speech recognition, and speech analysis. In most of these schemes, it is desirable to process speech in as close to real-time as possible, or using as little non-speech components of the signal as possible so as to minimize computational overhead. In fact, for most such conventional systems, both inaccurate speech endpoint detection and inclusion of non-speech components of the signal have an adverse effect on overall system performance.
There are a large variety of schemes for detecting speech endpoints in a signal. For example, one scheme commonly used for detecting speech endpoints in a signal is to use short-time or spectral energy components of the signal to identify speech within that signal. Often, an adaptive threshold based on features of an energy profile of the signal is used to discriminate between speech and background noise in the signal. Unfortunately, such schemes tend to cut off the ends of words in both noisy and quiet environments. Other endpoint detection schemes include examining signal entropy, using neural networks to examine the signal for extracting speech from background noise, etc.
As noted above, the detection of speech endpoints in a signal is central to a number of applications. Clearly, identifying the endpoints of speech in the signal requires an identification of both the onset and the termination of speech within that signal. Typically, analysis of several signal frames may be required to reliably detect speech onset and termination in the signal, even in a relatively noise free signal.
Further, many conventional speech detection schemes continue to encode signal frames as speech for a few frames after relative silence is first detected in the signal. In this manner, the end point or termination of speech in the signal is usually captured by the speech detection scheme at the cost of simply encoding a few extra signal frames. Unfortunately, since it is unknown when speech will begin in a real-time signal, performing a similar operation for capturing speech onset typically presents a more complex problem.
In particular, some schemes address the onset detection problem by simply buffering a number of signal frames until speech onset is detected in the signal. At that point, these schemes then encode the signal beginning with a number of the buffered frames so as to more reliably capture actual speech onset in the signal. Unfortunately, one of the problems with such schemes is that transmission or processing of the signal is typically delayed by the length of the signal buffer, thereby increasing overall signal delay or computational overhead. Attempts to address the average signal delay typically involve reducing buffer size in combination with better speech detection algorithms. However, the delay due to the use of a buffer still exists. Some schemes have attempted to address this problem by simply eliminating the buffer entirely, or by using a very small signal buffer. However, as a result, these schemes frequently chop off some small portion of the beginning of the speech in the signal. As a result, audible artifacts are often produced in the decoded signal.
Therefore, what is needed is a system and method that provides for robust and accurate speech onset detection in a signal while minimizing average signal delay resulting from the use of a signal frame buffer.