Recently, speech recognition systems have become more prevalent in today's high-technology market. Due to advances in computer technology and advances in speech recognition algorithms, these speech recognition systems have become more powerful.
Fundamental to all speech recognition systems is the manner in which the speech signal is represented. The speech signals are often represented according to their characteristics. When characterizing a speech signal, typically a short-term analysis approach is utilized in which a window, or frame (i.e., a short time interval) is isolated for spectral analysis. By using the short time analysis approach, speech can be analyzed on a time-varying basis.
One of the simplest representations of a signal which may be used to analyze a signal on a time-varying basis is its energy or power. Power provides a good measure for separating voiced speech segments from unvoiced speech segments. Usually, the energy for unvoiced segments is much smaller than for voiced segments. For very high quality speech, the power can be used to separate unvoiced speech from silence.
Another time domain analysis method is based on zero crossing measurements. For digitized speech signals, a zero crossing occurs between consecutive samples if the sign of each of the samples is the opposite of each other. Zero crossings are often used as an estimate of the frequency content of a speech signal. However, the interpretation of the zero crossings as applied to speech is much less precise due to the broad frequency spectrum of most sound signals. Zero crossings are also often used in making a decision about whether a particular segment of speech is voiced or unvoiced. If the zero crossing rate is high, the implication is unvoiced, while if the zero crossing rate is low, the segment is most likely to be voiced.
Although speech is analyzed as a time varying process normally, speech is also viewed on a short-time basis as the convolution of the excitation and vocal tract components associated with speech. The useful technique for integrating the convolution function into speech analysis is called "Cepstrum" analysis. In cepstrum, or cepstral analysis, the spectral envelope associated with the speech signal is separated from the norm due to the voiced sound by use of the Fourier transform of the logarithm of the spectrum. One well-known cepstrum technique is referred to as linear predictive coding (LPC). For more information, refer to Markel, J. D. and Gray, Jr., A. H., "Linear Production of Speech," Springer, Berlin Herdelberg New York, 1976.
Specifically, in cepstral analysis, the log-power spectrum is computed from the speech signal. Then the cepstrum is computed by taking the inverse Fourier transform of the log-power spectrum. Next, pitch extraction is performed, wherein a peak is located within a pitch range, a voiced to unvoiced decision is performed, and the pitch period is computed. Lastly, for cepstral analysis the spectral envelope is computed by windowing the cepstrum to remove the pitch effects and then taking the Fourier transform of the windowed cepstrum. In this manner, the cepstrum analysis is used for computing the spectral envelope and the pitch period.
A variety of types of speech recognition systems are in use today. One such type is commonly referred to as a continuous, or connected, speech recognition system. Continuous speech recognition systems are hierarchical in that entire phrases and sentences are recognized and grouped together to form larger units, as opposed to the recognition of single words.
In continuous speech, in order to recognize an utterance (i.e., a phrase or sentence), a determination must be made as to where the beginning and ending parts of each word are. Detection of the beginning and ending of individual phrases is usually referred to as end point detection. When the signal-to-noise ratio is high, the determination of the end points is not difficult. However, most speech recognition is not performed in environments with high signal-to-noise ratios. Therefore, weak fricatives and low-amplitude voiced sounds occurring at the end points of the utterance become difficult to detect, resulting in errors in their recognition. Most of the end point detection schemes of the prior art use some form of energy and zero crossing techniques. However, these energy and zero crossing techniques of the prior art are inadequate in dealing with noise (both transient and background).
Once the beginning and ending points of the utterances have been identified, the sound must be recognized. Currently, large numbers of words must be matched to the utterance during the recognition process. In an effort to reduce the amount of processing required, vector quantization has been used.
Vector quantization (VQ) techniques have been used to encode and decode speech signals for the purpose of data bandwidth compression. More specifically, in speech recognition systems, vector quantization has been used for preprocessing of speech data as a means for obtaining compact descriptors through the use of a relatively sparse set of codebook vectors to represent large dynamic floating point vector elements. For more information on vector quantization, see Gray., R. M., "Vector Quantization", IEEE ASSP Magazine, April, 1984, Vol. 1, No. 2. Once the data has been quantized, a recognition algorithm is used to perform the matching.
As will be shown, the present invention provides a method and apparatus for performing speech activity detection.