The invention is directed to encoding and decoding speech.
Speech encoding and decoding have a large number of applications and have been studied extensively. In general, one type of speech coding, referred to as speech compression, seeks to reduce the data rate needed to represent a speech signal without substantially reducing the quality or intelligibility of the speech. Speech compression techniques may be implemented by a speech coder.
A speech coder is generally viewed as including an encoder and a decoder. The encoder produces a compressed stream of bits from a digital representation of speech, such as may be generated by converting an analog signal produced by a microphone using an analog-to-digital converter. The decoder converts the compressed bit stream into a digital representation of speech that is suitable for playback through a digital-to-analog converter and a speaker. In many applications, the encoder and decoder are physically separated, and the bit stream is transmitted between them using a communication channel.
A key parameter of a speech coder is the amount of compression the coder achieves, which is measured by the bit rate of the stream of bits produced by the encoder. The bit rate of the encoder is generally a function of the desired fidelity (i.e., speech quality) and the type of speech coder employed. Different types of speech coders have been designed to operate at high rates (greater than 8 kbs), mid-rates (3-8 kbs) and low rates (less than 3 kbs). Recently, mid-rate and low-rate speech coders have received attention with respect to a wide range of mobile communication applications (e.g., cellular telephony, satellite telephony, land mobile radio, and in-flight telephony). These applications typically require high quality speech and robustness to artifacts caused by acoustic noise and channel noise (e.g., bit errors).
Vocoders are a class of speech coders that have been shown to be highly applicable to mobile communications. A vocoder models speech as the response of a system to excitation over short time intervals. Examples of vocoder systems include linear prediction vocoders, homomorphic vocoders, channel vocoders, sinusoidal transform coders ("STC"), multiband excitation ("MBE") vocoders, and improved multiband excitation ("IMBE.TM.") vocoders. In these vocoders, speech is divided into short segments (typically 10-40 ms) with each segment being characterized by a set of model parameters. These parameters typically represent a few basic elements of each speech segment, such as the segment's pitch, voicing state, and spectral envelope. A vocoder may use one of a number of known representations for each of these parameters. For example the pitch may be represented as a pitch period, a fundamental frequency, or a long-term prediction delay. Similarly the voicing state may be represented by one or more voiced/unvoiced decisions, by a voicing probability measure, or by a ratio of periodic to stochastic energy. The spectral envelope is often represented by an all-pole filter response, but also may be represented by a set of spectral magnitudes or other spectral measurements.
Since they permit a speech segment to be represented using only a small number of parameters, model-based speech coders, such as vocoders, typically are able to operate at medium to low data rates. However, the quality of a model-based system is dependent on the accuracy of the underlying model. Accordingly, a high fidelity model must be used if these speech coders are to achieve high speech quality.
One speech model which has been shown to provide high quality speech and to work well at medium to low bit rates is the Multi-Band Excitation (MBE) speech model developed by Griffin and Lim. This model uses a flexible voicing structure that allows it to produce more natural sounding speech, and which makes it more robust to the presence of acoustic background noise. These properties have caused the MBE speech model to be employed in a number of commercial mobile communication applications.
The MBE speech model represents segments of speech using a fundamental frequency, a set of binary voiced/unvoiced (V/UV) metrics, and a set of spectral magnitudes. A primary advantage of the MBE model over more traditional models is in the voicing representation. The MBE model generalizes the traditional single V/UV decision per segment into a set of decisions, each representing the voicing state within a particular frequency band. This added flexibility in the voicing model allows the MBE model to better accommodate mixed voicing sounds, such as some voiced fricatives. In addition this added flexibility allows a more accurate representation of speech that has been corrupted by acoustic background noise. Extensive testing has shown that this generalization results in improved voice quality and intelligibility.
The encoder of an MBE-based speech coder estimates the set of model parameters for each speech segment. The MBE model parameters include a fundamental frequency (the reciprocal of the pitch period); a set of V/UV metrics or decisions that characterize the voicing state; and a set of spectral magnitudes that characterize the spectral envelope. After estimating the MBE model parameters for each segment, the encoder quantizes the parameters to produce a frame of bits. The encoder optionally may protect these bits with error correction/detection codes before interleaving and transmitting the resulting bit stream to a corresponding decoder.
The decoder converts the received bit stream back into individual frames. As part of this conversion, the decoder may perform deinterleaving and error control decoding to correct or detect bit errors. The decoder then uses the frames of bits to reconstruct the MBE model parameters, which the decoder uses to synthesize a speech signal that perceptually resembles the original speech to a high degree. The decoder may synthesize separate voiced and unvoiced components, and then may add the voiced and unvoiced components to produce the final speech signal.
In MBE-based systems, the encoder uses a spectral magnitude to represent the spectral envelope at each harmonic of the estimated fundamental frequency. Typically each harmonic is labeled as being either voiced or unvoiced, depending upon whether the frequency band containing the corresponding harmonic has been declared voiced or unvoiced. The encoder then estimates a spectral magnitude for each harmonic frequency. When a harmonic frequency has been labeled as being voiced, the encoder may use a magnitude estimator that differs from the magnitude estimator used when a harmonic frequency has been labeled as being unvoiced. At the decoder, the voiced and unvoiced harmonics are identified, and separate voiced and unvoiced components are synthesized using different procedures. The unvoiced component may be synthesized using a weighted overlap-add method to filter a white noise signal. The filter is set to zero all frequency regions declared voiced while otherwise matching the spectral magnitudes labeled unvoiced. The voiced component is synthesized using a tuned oscillator bank, with one oscillator assigned to each harmonic that has been labeled as being voiced. The instantaneous amplitude, frequency and phase are interpolated to match the corresponding parameters at neighboring segments.
MBE-based speech coders include the IMBE.TM. speech coder and the AMBE.RTM. speech coder. The AMBE.RTM. speech coder was developed as an improvement on earlier MBE-based techniques. It includes a more robust method of estimating the excitation parameters (fundamental frequency and V/UV decisions) which is better able to track the variations and noise found in actual speech. The AMBE.RTM. speech coder uses a filterbank that typically includes sixteen channels and a non-linearity to produce a set of channel outputs from which the excitation parameters can be reliably estimated. The channel outputs are combined and processed to estimate the fundamental frequency and then the channels within each of several (e.g., eight) voicing bands are processed to estimate a V/UV decision (or other voicing metric) for each voicing band.
The AMBE.RTM. speech coder also may estimate the spectral magnitudes independently of the voicing decisions. To do this, the speech coder computes a fast Fourier transform ("FFT") for each windowed subframe of speech and then averages the energy over frequency regions that are multiples of the estimated fundamental frequency. This approach may further include compensation to remove from the estimated spectral magnitudes artifacts introduced by the FFT sampling grid.
The AMBE.RTM. speech coder also may include a phase synthesis component that regenerates the phase information used in the synthesis of voiced speech without explicitly transmitting the phase information from the encoder to the decoder. Random phase synthesis based upon the V/UV decisions may be applied, as in the case of the IMBE.TM. speech coder. Alternatively, the decoder may apply a smoothing kernel to the reconstructed spectral magnitudes to produce phase information that may be perceptually closer to that of the original speech than is the randomly-produced phase information.
The techniques noted above are described, for example, in Flanagan, Speech Analysis, Synthesis and Perception, Springer-Verlag, 1972, pages 378-386 (describing a frequency-based speech analysis-synthesis system); Jayant et al., Digital Coding of Waveforms, Prentice-Hall, 1984 (describing speech coding in general); U.S. Pat. No. 4,885,790 (describing a sinusoidal processing method); U.S. Pat. No. 5,054,072 (describing a sinusoidal coding method); Almeida et al., "Nonstationary Modeling of Voiced Speech", IEEE TASSP, Vol. ASSP-31, No. 3, June 1983, pages 664-677 (describing harmonic modeling and an associated coder); Almeida et al., "Variable-Frequency Synthesis: An Improved Harmonic Coding Scheme", IEEE Proc. ICASSP 84, pages 27.5.1-27.5.4 (describing a polynomial voiced synthesis method); Quatieri et al., "Speech Transformations Based on a Sinusoidal Representation", IEEE TASSP, Vol, ASSP34, No. 6, December. 1986, pages 1449-1986 (describing an analysis-synthesis technique based on a sinusoidal representation); McAulay et al., "Mid-Rate Coding Based on a Sinusoidal Representation of Speech", Proc. ICASSP 85, pages 945-948, Tampa, Fla., March 26-29, 1985 (describing a sinusoidal transform speech coder); Griffin, "Multiband Excitation Vocoder", Ph.D. Thesis, M.I.T, 1987 (describing the Multi-Band Excitation (MBE) speech model and an 8000 bps MBE speech coder); Hardwick, "A 4.8 kbps Multi-Band Excitation Speech Coder", SM. Thesis, M.I.T, May 1988 (describing a 4800 bps Multi-Band Excitation speech coder); Telecommunications Industry Association (TIA), "APCO Project 25 Vocoder Description", Version 1.3, Jul. 15, 1993, IS102BABA (describing a 7.2 kbps IMBE.TM. speech coder for APCO Project 25 standard); U.S. Pat. No. 5,081,681 (describing IMBEM.TM. random phase synthesis); U.S. Pat. No. 5,247,579 (describing a channel error mitigation method and formant enhancement method for MBE-based speech coders); U.S. Pat. No. 5,226,084 (describing quantization and error mitigation methods for MBE-based speech coders); U.S. Pat. No. 5,517,511 (describing bit prioritization and FEC error control methods for MBE-based speech coders).