Modern human communication increasingly relies on the transmission of digital representations of acoustic speech over large distances. This digital representation contains only a fraction of the information about the human voice, and yet humans are perfectly capable of understanding a digital speech signal.
Some communication systems, such as automated telephone attendants and other interactive voice response systems (IVRs), rely on computers to understand a digital speech signal. Such systems recognize the sounds as well as the meaning inherent in human speech, thereby extracting the speech content of a digitized acoustic signal. In the medical and health care fields, correctly extracting speech content from a digitized acoustic signal can be a matter of life or death, making accurate signal analysis and interpretation particularly important.
One approach to analyzing a speech signal to extract speech content is based on modeling the acoustic properties of the vocal tract during speech production. Generally, during speech production, the configuration of the vocal tract determines an acoustic speech signal made up of a set of speech resonances. These speech resonances can be analyzed to extract speech content from the speech signal.
In order to determine accurately the acoustic properties of the vocal tract during speech production, both the frequency and the bandwidth of each speech resonance are required. Generally, the frequency corresponds to the size of the cavity within the vocal tract, and the bandwidth corresponds to the acoustic losses of the vocal tract. Together, these two parameters determine the formants of speech.
During speech production, speech resonance frequency and bandwidth may change quickly, on the order of a few milliseconds. In most cases, the speech content of a speech signal is a function of sequential speech resonances, so the changes in speech resonances must be captured and analyzed at least as quickly as they change. As such, accurate speech analysis requires simultaneous determination of both the frequency and bandwidth of each speech resonance on the same time scale as speech production, that is, on the order of a few milliseconds. However, the simultaneous determination of frequency and bandwidth of speech resonances on this time scale has proved difficult.
Some previous work in formant estimation has been concerned with finding only the frequency of speech resonances in speech signals. These frequency-oriented methods use the instantaneous frequency for high time-resolution frequency estimates. However, these methods for frequency estimation are limited in flexibility, and do not fully describe the speech resonances.
For example, Nelson, et al., have developed a number of methods, including U.S. Pat. No. 6,577,968 for a “Method of estimating signal frequency,” on Jun. 10, 2003, by Douglas J. Nelson; U.S. Pat. No. 7,457,756 for a “Method of generating time-frequency signal representation preserving phase information,” on Nov. 25, 2008, by Douglas J. Nelson and David Charles Smith; and U.S. Pat. No. 7,492,814 for a “Method of removing noise and interference from signal using peak picking,” on Feb. 17, 2009, by Douglas J. Nelson.
Generally, systems consistent with the Nelson methods (“Nelson-type systems”) use instantaneous frequency to enhance the calculation of a Short-Time Fourier Transform (STFT), a common transform in speech processing. In Nelson-type systems, the instantaneous frequency is calculated as the time-derivative of the phase of a complex signal. The Nelson-type systems approach computes the instantaneous frequency from conjugate products of delayed whole spectra. Having computed the instantaneous frequency of each time-frequency element in the STFT, the Nelson-type systems approach re-maps the energy of each element to its instantaneous frequency. This Nelson-type re-mapping results in a concentrated STFT, with energy previously distributed across multiple frequency bands clustering around the same instantaneous frequency.
Auger & Flandrin also developed an approach, which is described in: F. Auger and P. Flandrin, “Improving the readability of time-frequency and time-scale representations by the reassignment method,” Signal Processing, IEEE Transactions on 43, no. 5 (May 1995): 1068-1089 (“Auger/Flandrin”). Systems consistent with the Auger/Flandrin approach (“Auger/Flandrin-type systems”) offer an alternative to the concentrated Short-Time Fourier Transform (STFT) of Nelson-type systems. Generally, Auger/Flandrin-type systems compute several STFTs with different windowing functions. Auger/Flandrin-type systems use the derivative of the window function in the STFT to get the time-derivative of the phase, and the conjugate product is normalized by the energy. Auger/Flandrin-type systems yield a more exact solution for the instantaneous frequency than the Nelson-type systems' approach, as the derivative is not estimated in the discrete implementation.
However, as extensions of STFT approaches, both Nelson-type and Auger/Flandrin-type systems lack the necessary flexibility to model human speech effectively. For example, the transforms of both Nelson-type and Auger/Flandrin-type systems determine window length and frequency spacing for the entire STFT, which limits the ability to optimize the filter bank for speech signals. Moreover, while both types find the instantaneous frequencies of signal components, neither type finds the instantaneous bandwidths of the signal components. As such, both the Nelson-type and Auger/Flandrin-type approaches suffer from significant drawbacks that limit their usefulness in speech processing.
Gardner and Mognasco describe an alternate approach in: T. J. Gardner and M. O. Magnasco, “Instantaneous frequency decomposition: An application to spectrally sparse sounds with fast frequency modulations,” The Journal of the Acoustical Society of America 117, no. 5 (2005): 2896-2903 (“Gardner/Mognasco”). Systems consistent with the Gardner/Mognasco approach (“Gardner/Mognasco-type systems”) use a highly-redundant complex filter bank, with the energy from each filter remapped to its instantaneous frequency, similar to the Nelson approach above. Gardner/Mognasco-type systems also use several other criteria to further enhance the frequency resolution of the representation.
That is, the Gardner/Mognasco-type systems discard filters with a center frequency far from the estimated instantaneous frequency, which can reduce the frequency estimation error from filters not centered on the signal component frequency. Gardner/Mognasco-type systems also use an amplitude threshold to remove low-energy frequency estimates and optimize the bandwidths of filters in a filter bank to maximize the consensus of the frequency estimates of adjacent filters. Gardner/Mognasco-type systems then use consensus as a measure of the quality of the analysis, where high consensus across filters indicates a good frequency estimate.
However, Gardner/Mognasco-type systems also suffer from significant drawbacks. First, Gardner/Mognasco-type systems do not account for instantaneous bandwidth calculation, thus missing an important part of the speech formant. Second, a consensus approach can lock in an error when a group of frequency estimates are briefly consistent with each other, but nevertheless provide inaccurate estimates of the true resonance frequency. For both of these reasons, Gardner/Mognasco-type systems offer limited usefulness in speech processing applications, particularly those applications that require higher accuracy over a short time scale.
While the above methods attempt to determine instantaneous frequency without also determining instantaneous bandwidth, Potamianos and Maragos developed a method for obtaining both the frequency and bandwidth of formants of a speech signal. The Potamianos/Maragos approach is described in: Alexandros Potamianos and Petros Maragos, “Speech formant frequency and bandwidth tracking using multiband energy demodulation,” The Journal of the Acoustical Society of America 9, no. 6 (1996): 3795-3806 (“Potamianos/Maragos”).
Systems consistent with the Potamianos/Maragos approach (“Potamianos/Maragos-type systems”) use a filter bank of real-valued Gabor filters, and calculate the instantaneous frequency at each time-sample using an energy separation algorithm to demodulate the signal into an instantaneous frequency and amplitude envelope. In Potamianos/Maragos-type systems, the instantaneous frequency is then time-averaged to give a short-time estimate of the frequency, with a time window of about 10 ms. In Potamianos/Maragos-type systems, the bandwidth estimate is simply the standard deviation of the instantaneous frequency over the time window.
Thus, while Potamianos/Maragos-type systems offer the flexibility of a filter bank (rather than a transform), Potamianos/Maragos-type systems only indirectly estimate the instantaneous bandwidth by using the standard deviation. That is, because the standard deviation requires a time average, the bandwidth estimate in Potamianos/Maragos-type systems is not instantaneous. Because the bandwidth estimate is not instantaneous, the frequency and bandwidth estimates must be averaged over longer times than are practical for real-time speech recognition. As such, the Potamianos/Maragos-type systems also fail to determine speech formants on the time scale preferred for real-time speech processing.