Compressing speech to low bit rates while maintaining high quality is an important problem, the solution to which has many applications, such as, for example, memory constrained systems. One compression scheme (coders) used to solve this problem is multi-band excitation (MBE), a scheme derived from sinusoidal coding.
The MBE scheme involves use of a parametric model, which segments speech into frames. Then, for each segment of speech, excitation and system parameters are estimated. The excitation parameters include pitch frequency values, voiced/unvoiced decisions and the amount of voicing in case of voiced frames. The system parameters include spectral magnitude and spectral amplitude values, which are encoded based on whether the excitation is sinusoidal or harmonic.
Though coders based on this model have been successful in synthesizing intelligible speech at low bit rates, they have not been successful in synthesizing high quality speech, mainly because of incorrect parameter estimation. As a result, these coders have not been widely used. Some of the problems encountered are listed as follows.
In the MBE model, parameters have a strong dependence on pitch frequency because all other parameters are estimated assuming that the pitch frequency has been accurately computed.
Most sinusoidal coders, including the MBE based coders, depend on an accurate reproduction of the harmonic structure of spectra for voiced speech segments. Consequently, estimating the pitch frequency becomes important because harmonics are multiples of the pitch frequency.
Another important aspect of the MBE scheme is the classification of a segment as voiced, unvoiced or silence segment. This is important because the three types of segments are represented differently and their representations have a different impact on the overall compression efficiency of the scheme. Previous schemes use inaccurate measures, such as zero-crossing rate and auto-correlation for these decisions.
MBE based coders also suffer from undesirable perceptual effects arising out of saturation caused by unbalanced output waveforms. An absence of phase information in decoders in use causes the unbalance.
Publications relevant to voice encoding include: McAulay et al., “Mid-Rate Coding based on a sinusoidal representation of speech”, Proc. ICASSP85, pp. 945-948, Tampa, Fla., Mar. 26-29, 1985 (discusses the sinusoidal transform speech coder); Griffin, “Multi-band Excitation Vocoder”, Ph.D. Thesis, M.I.T, 1987, (Discusses the Multi-Band Excitation (MBE) speech model and an 8000 kbps MBE speech coder); SM. Thesis, M.I.T, May 1988, (discusses a 4800 bps Multi-Band Excitation speech coder); McAulay et al., “Computationally efficient Sine-Wave Synthesis and its applications to Sinusoidal Transform coding”, Proc. ICASSP 88, New York, N.Y., pp. 370-373, April 1988, (discusses frequency domain voiced synthesis); D. W. Griffin, J. S. Lim, “Multi-band Excitation Vocoder,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 36, pp. 1223-1235, August 1988; Tian Wang, Kun Tang, Chonxgi Feng “A high quality MBE-LPC-FE Speech coder at 2.4 kbps and 1.2 kbps, Dept. of Electronic Engineering, Tsinghua University, Beijing, 100084, P. R. Chinna; Engin Erzin, Arun kumar and Allen Gersho “Natural quality variable-rate spectral speech coding below 3.0 kbps, Dept. of Electrical & Computer Eng., University of California, Santa Barbara, Calif., 93106 USA; INMARSAT M voice codec, Digital voice systems Inc. 1991, version 3.0 August 1991; A. M. Kondoz, Digital speech coding for low bit rate communication systems, John Wiley and Sons; Telecommunications Industry Association (TIA) “APCO project 25 Vocoder description” Version 1.3, Jul. 15, 1993, IS102BABA (discusses 7.2 kbps IMBE speech coder for APCO project 25 standard); U.S. Pat. No. 5,081,681 (discloses MBE random phase synthesis); Jayant et al., Digital Coding of Waveforms, Prentice-Hall, 1984, (discussing the speech coding in general); U.S. Pat. No. 4,885,790 (discloses sinusoidal processing method); Makhoul, “A mixed-source model for speech compression and synthesis”, IEEE (1978), pp. 163-166 ICASSP78; Griffin et al. “Signal estimation from modified short-time fourier transform”, IEEE transactions on Acoustics, speech and signal processing, vol. ASSP-32, No. 2 , April 1984, pp 236-243; Hardwick, “A 4.8 kbps multi-band excitation speech coder”, S.M. Thesis, M.I.T., May 1988; P. Bhattacharya, M. Singhal and Sangeetha, “An analysis of the weaknesses of the MBE coding scheme,” IEEE international conf. on personal wireless communications, 1999; Almeida et al., “Harmonic coding: A low bit rate, good quality speech coding technique,” IEEE (CH 1746-7/82/000 1684) pp. 1664-1667 (1982); Digital voice systems, Inc. “The DVSI IMBE speech compression system,” advertising brochure (May 12, 1993); Hardwick et al., “The application of the IMBE speech coder to Mobile communications,” IEEE (1991), pp. 249-252 ICASSP 91 May 1991; Portnoff, “Short-time fourier analysis of samples speech”, IEEE transactions on acoustics, speech and signal processing , vol. ASSP-29, No-3, June 1981, pp. 324-333; W. B Klein and K. K. Paliwal “Speech coding and synthesis”; Akaike H., “Power spectrum estimation through auto-regressive model fitting,” Ann. Inst. Statist. Math., Vol. 21, pp. 407-419, 1969; Anderson, T. W., “The statistical analysis of time series,” Wiley, 1971; Durbin, J., “The fitting of time-series models,” Rev. Inst. Int. Statist., Vol. 28, pp. 233-243, 1960; Makhoul J., “Linear Prediction: a tutorial review,” Proc. IEEE, Vol. 63, pp. 561-580, April 1975; Kay S. M., “Modern spectral estimation: theory and application,” Prentice Hall, 1988; Mohanty M., “Random signals estimation and identification,” Van Nostrand Reinhold, 1986. The contents of these references are incorporated herein by reference.
Various methods have been described for pitch tracking but each method has its respective limitations. In “Processing a speech signal with estimated pitch” (U.S. Pat. No. 5,226,108), Hardwick, et al. has described a sub-multiple check method for pitch, a pitch tracking algorithm for estimating a correct pitch frequency and a voiced/unvoiced decision of each band, which is based on an energy threshold value.
In “Voiced/unvoiced estimation of an acoustic signal” (U.S. Pat. No. 5,216,747), Hardwick et al. has described a method for estimating voiced/unvoiced classifications for each band. The estimation, however, is based on a threshold value, which depends upon the pitch and the center frequency of each band. Similarly, in INMARSAT M voice codec (Digital voice systems Inc. 1991, version 3.0 August 1991) the voiced/unvoiced decision for each band depends upon threshold values which in turn depend upon the energy of current and previous frames. Occasionally, these parameters are not updated well, which results in incorrect decisions for some bands and a deteriorated output speech quality.
In “Synthesis of MBE based coded speech using regenerated phase information” (U.S. Pat. No. 5,701,390), Griffin et al. has described a method for generating a voiced component phase in speech synthesis. The phase is estimated from a spectral envelope of the voiced component (e.g. from the shape of the spectral envelope in the vicinity of the voiced component). The decoder reconstructs the spectral envelope and voicing information for each of a plurality of frames. The voicing information is used to determine whether frequency bands for a particular spectrum are voiced or unvoiced. Speech components for voiced frequency bands are synthesized using the regenerated spectral phase information. Components for unvoiced frequency bands are generated using other techniques.
The discussed methods do not provide solutions to the problems described above. The invention presents solutions to these problems and provides significant improvements to the quality of MBE based speech compression algorithms. For example, the invention presents a novel method for reducing the complexity of unvoiced synthesis at the decoder. It also describes a scheme for making the voiced/unvoiced decision for each band and computing a single Voicing Parameter, which is used to identify a transition point from a voiced to an unvoiced region in the spectrum; Compact spectral amplitude representation is also described.