There has been a substantial amount of effort in developing toll-quality speech coders that operate below 4 kbps. Most of the coders in this bit-range are parametric in nature; One of the most prominent among these is the Multiband Excitation (MBE) Coder developed by Griffin and Lim. The MBE scheme is derived from mainstream sinusoidal coding (McAulay et al.), where voiced speech is reproduced as a weighted sum of sine waves at the harmonics of a pitch frequency and unvoiced speech bands are reproduced as bandlimited white noise with appropriate amplitudes. The encoding is performed by splitting the input speech into frequency bands centered around the harmonics, and recording the respective spectral amplitudes based on the outcome of corresponding voicing decisions (assuming the excitation is a sinusoid or narrowband noise for the voiced and unvoiced cases, respectively).
The MBE coding scheme has the potential to produce high quality (in terms of intelligibility and naturalness) output speech (Tian et al.) at very low bit rates. The parameters used in the MBE coding scheme are also resistant to moderate levels of noise (15 dB wideband white noise). There are, however, some undesirable characteristics of the scheme that severely hamper the deployment of MBE-based codecs for the purpose of coding speech produced in noisy ambient conditions (above 10 dB wideband noise) and/or speech received via transmission paths, such as a telephone channel.
Under transmission path conditions, and in particular, under telephone-channel-bandwidth (TCB) conditions, the baseband frequencies are grossly attenuated, as shown in FIG. 3. This frequently results in the loss of a pitch component and the components first one or two harmonics for low-pitched speakers, a phenomenon which greatly hampers pitch detection. Pitch detection also becomes increasingly faulty above 15 dB wideband (ambient) noise in the input speech signal. However, all parameter estimates in the MBE scheme are, in one way or the other, derived via a spectral matching mechanism which in turn crucially depends on the harmonic structure created using the pitch parameter. The pitch parameter, therefore is the pivotal element in the parameter estimation scheme and, consequently, errors in pitch detection frequently lead to the corruption of other parameters. As a result, the MBE codec decoded output is prone to several audible distortions such as voice-breaks, screeches, clicks, varying levels of hoarseness, and occasional synthetic tonality, for speech having transmission path characteristics, such as TCB and/or noisy input speech.
It has been confirmed, through repeated tests for speech decoded from TCB inputs, that voice-breaks observed are frequently associated with pitch region (period) halving, while hoarseness is associated with undervoicing. These problems are dominant for low-pitched speakers. Tonality, on the other hand, results from overvoicing.
One spectral amplitude quantization technique involves intermediate spectral smoothing (e.g. if LPC is used, as suggested by Kondoz, a screeching effect is produced for pitch doublings, although such occurrences are relatively infrequent).
The robustness problems discussed above have greatly limited the deployment of MBE coders in real-life situations, except for mobile communications, which have significantly lower quality demands. In a broad sense, these problems have deterred the achievement of toll-quality speech (implying indistinguishable from telephone speech quality) for MBE coders.
This is unfortunate since MBE coders, which have high compression ratios, may be used in a number of applications (primarily storage applications) that are strapped for memory resources. The MBE coders provide twice, and in some cases three times the speech storage capacity over conventional CELP coders.
CELP coders imply waveform coding (as opposed to spectral coding in MBE), and degrade miserably when operating at rates below 5 kbps. For clean 4 kHz bandwidth speech (i.e. sampled at 8 kHz, but not subject to the exact telephone-channel frequency response), MBE codecs deliver virtually the same output quality, at 2-3 kbps, as higher bit rate (5-6 kbps) CELP codecs. However, because of the earlier cited MBE coder problems, the latter continue to be preferred for use in voice communication and storage applications that assume noise and transmission path characteristics, such as telephone channel bandwidth conditions (CELP codecs degrade gracefully under either condition), under normal operating conditions.
Quality degradation in MBE codecs for noisy and transmission path speech, such as telephone-channel speech, has been persistent since its advent. A root cause analysis of the reasons for the distortions induced under the above-mentioned conditions were presented by Bhattacharya et al. in 1999, but researchers have been aware of the existence of the problems for a long time.
Researchers, thus far, have attempted to provide robustness to MBE coders by changing the basic MBE codec modules. They have essentially suggested alternative methods for the robust estimation of pitch and voicing parameters.
These alternate attempts to compensate for transmission path characteristics, such as telephone-channel characteristics, by inverse filtering and to compensate for noise in the input signal by spectral subtraction have not been popular mainly because of the associated implementation problems. In the former case, designing a stable inverse filter for the telephone channel becomes an insurmountable problem when conventional design methods are applied. This is because the telephone channel inverse characteristic involves a major gently sloping segment accompanied by sharp peaks at either end, and deviation from the expected curve becomes audible at virtually all frequencies. In the latter case, the noise compensation process breeds a tonal noise called musical noise, which appears at the decoded output as an unacceptable distortion.
Previous solutions to the projected problem have only been marginally effective because the basic speech signal is often highly corrupted and because the basic speech signal produces a spurious signal with parameter values lying within expected bounds. A common example, in this regard, is where a multiple of the pitch frequency becomes the dominant lowest harmonic and suppresses the actual fundamental frequency under telephone-channel bandwidth conditions. The amount of parametric corruption varies within wide limits (e.g. depending on the loudness and type of noise) further complicating the robust-estimation process.
In addition, one should note that there have not been any estimation processes that have been 100% reliable even under absolutely clean input speech conditions. The pitch estimation accuracy of the invention, when used with the MBE model, decreases gracefully from a 0.2% coarse error rate at 30 dB ambient (white) noise to a 5% coarse error rate at 10 dB ambient noise.
Publications relevant to processing signals representing speech include: McAulay et al., “Mid-Rate Coding based on a sinusoidal representation of speech”, Proc. ICASSP85, pp. 945-948, Tampa, Fla., Mar. 26-29, 1985 (discusses the sinusoidal transform speech coder); Griffin, “Multi-band Excitation Vocoder”, Ph.D. Thesis, M.I.T, 1987, (Discusses the Multi-Band Excitation (MBE) speech model and an 8000 kbps MBE speech coder); SM. Thesis, M.I.T, May 1988, (discusses a 4800 bps Multi-Band Excitation speech coder); McAulay et al., “Computationally efficient Sine-Wave Synthesis and its applications to Sinusoidal Transform coding”, Proc. ICASSP 88, New York, N.Y., pp. 370-373, April 1988, (discusses frequency domain voiced synthesis); D. W. Griffin, J. S. Lim, “Multi-band Excitation Vocoder,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 36, pp. 1223-1235, August 1988; P. Bhattacharya, M. Singhal and Sangeetha, “An analysis of the weaknesses of the MBE coding scheme,” IEEE international conf. on personal wireless communications, 1999; Tian Wang, Kun Tang, Chonxgi Feng “A high quality MBE-LPC-FE Speech coder at 2.4 kbps and 1.2 kbps, Dept. of Electronic Engineering, Tsinghua University, Beijing, 100084, P. R. Chinna; Engin Erzin, Arun kumar and Allen Gersho “Natural quality variable-rate spectral speech coding below 3.0 kbps, Dept. of Electrical and Computer Eng., University of California, Santa Barbara, Calif., 93106 USA; INMARSAT M voice codec, Digital voice systems Inc. 1991, version 3.0 August 1991; A. M. Kondoz, Digital speech coding for low bit rate communication systems, John Wiley and Sons; Telecommunications Industry Association (TIA) “APCO project 25 Vocoder description” Version 1.3, Jul. 15, 1993, IS102BABA (discusses 7.2 kbps IMBE speech coder for APCO project 25 standard); Telephone transmission quality transmission standards, ITU Recommendation p. 48; U.S. Pat. No. 5,081,681 (discloses MBE random phase synthesis); Jayant et al., Digital Coding of Waveforms, Prentice-Hall, 1984, (discussing the speech coding in general); U.S. Pat. No. 4,885,790 (discloses sinusoidal processing method); Makhoul, “A mixed-source model for speech compression and synthesis”, IEEE (1978), pp. 163-166 ICASS P78; Griffin et al. “Signal estimation from modified short-time fourier transform”, IEEE transactions on Acoustics, speech and signal processing, vol. ASSP-32, No. 2, April 1984, pp. 236-243; Hardwick, “A 4.8 kbps multi-band excitation speech coder”, S. M. Thesis, M.I.T., May 1988; Almeida et al., “Harmonic coding: A low bit rate, good quality speech coding technique,” IEEE (CH 1746-7/82/000 1684) pp. 1664-1667 (1982); Digital voice systems, Inc. “The DVSI IMBE speech compression system,” advertising brochure (May 12, 1993); Hardwick et al., “The application of the IMBE speech coder to Mobile communications,” IEEE (1991), pp. 249-252 ICASSP 91 May 1991; Portnoff, “Short-time fourier analysis of samples speech”, IEEE transactions on accoustics, speech and signal processing, vol. ASSP-29, No-3, June 1981, pp. 324-333; Akaike H., “Power spectrum estimation through auto-regressive model fitting,” Ann. Inst. Statist. Math., Vol. 21, pp. 407-419, 1969; Anderson, T. W., “The statistical analysis of time series,” Wiley, 1971; Durbin, J., “The fitting of time-series models,” Rev. Inst. Int. Statist., Vol. 28, pp. 233-243, 1960; Makhoul J., “Linear Prediction: a tutorial review,” Proc. IEEE, Vol. 63, pp. 561-580, April 1975; Kay S. M., “Modern spectral estimation: theory and application,” Prentice Hall, 1988; Mohanty M., “Random signals estimation and identification,” Van Nostrand Reinhold, 1986. The content of the publications listed above are incorporated herein by reference.