Speech encoding and decoding have a large number of applications and have been studied extensively. In general, speech coding, which is also known as speech compression, seeks to reduce the data rate needed to represent a speech signal without substantially reducing the quality or intelligibility of the speech. Speech compression techniques may be implemented by a speech coder, which also may be referred to as a voice coder or vocoder.
A speech coder is generally viewed as including an encoder and a decoder. The encoder produces a compressed stream of bits from a digital representation of speech, such as may be generated at the output of an analog-to-digital converter having as an input an analog signal produced by a microphone. The decoder converts the compressed bit stream into a digital representation of speech that is suitable for playback through a digital-to-analog converter and a speaker. In many applications, the encoder and the decoder are physically separated, and the bit stream is transmitted between them using a communication channel.
A key parameter of a speech coder is the amount of compression the coder achieves, which is measured by the bit rate of the stream of bits produced by the encoder. The bit rate of the encoder is generally a function of the desired fidelity (i.e., speech quality) and the type of speech coder employed. Different types of speech coders have been designed to operate at different bit rates. Recently, low-to-medium rate speech coders operating below 10 kbps have received attention with respect to a wide range of mobile communication applications (e.g., cellular telephony, satellite telephony, land mobile radio, and in-flight telephony). These applications typically require high quality speech and robustness to artifacts caused by acoustic noise and channel noise (e.g., bit errors).
Speech is generally considered to be a non-stationary signal having signal properties that change over time. This change in signal properties is generally linked to changes made in the properties of a person's vocal tract to produce different sounds. A sound is typically sustained for some short period, typically 10-100 ms, and then the vocal tract is changed again to produce the next sound. The transition between sounds may be slow and continuous, or the transition may be rapid as in the case of a speech “onset.” This change in signal properties increases the difficulty of encoding speech at lower bit rates since some sounds are inherently more difficult to encode than others and the speech coder must be able to encode all sounds with reasonable fidelity while preserving the ability to adapt to a transition in characteristics of the speech signal. One way to improve the performance of a low-to-medium bit rate speech coder is to allow the bit rate to vary. In variable-bit-rate speech coders, the bit rate for each segment of speech is not fixed, and, instead, is allowed to vary between two or more options depending on various factors, such as user input, system loading, terminal design or signal characteristics.
There have been several main approaches for coding speech at low-to-medium data rates. For example, an approach based around linear predictive coding (LPC) attempts to predict each new frame of speech from previous samples using short and long term predictors. The prediction error is typically quantized using one of several approaches of which CELP and/or multi-pulse are two examples. An advantage of the LPC method is that it has good time resolution, which is helpful for the coding of unvoiced sounds. In particular, plosives and transients benefit from this in that they are not overly smeared in time. However, linear prediction may have difficulty for voiced sounds in that the coded speech tends to sound rough or hoarse due to insufficient periodicity in the coded signal. This problem may be more significant at lower data rates that typically require a longer frame size and for which the long-term predictor is less effective at restoring periodicity.
Another leading approach for low-to-medium rate speech coding is a model-based speech coder or vocoder. A vocoder models speech as the response of a system to excitation over short time intervals. Examples of vocoder systems include linear prediction vocoders (e.g., MELP), homomorphic vocoders, channel vocoders, sinusoidal transform coders (“STC”), harmonic vocoders and multiband excitation (“MBE”) vocoders. In these vocoders, speech is divided into short segments (typically 10-40 ms), with each segment being characterized by a set of model parameters. These parameters typically represent a few basic elements of each speech segment, such as the pitch, voicing state, and spectral envelope of the segment. A vocoder may use one of a number of known representations for each of these parameters. For example, the pitch may be represented as a pitch period, a fundamental frequency or pitch frequency (which is the inverse of the pitch period), or as a long-term prediction delay. Similarly, the voicing state may be represented by one or more voicing metrics, by a voicing probability measure, or by a set of voicing decisions. The spectral envelope is often represented by an all-pole filter response, but also may be represented by a set of spectral magnitudes or other spectral measurements. Since model-based speech coders permit a speech segment to be represented using only a small number of parameters, model-based speech coders, such as vocoders, typically are able to operate at medium to low data rates. However, the quality of a model-based system is dependent on the accuracy of the underlying model. Accordingly, a high fidelity model must be used if these speech coders are to achieve high speech quality.
The MBE vocoder is a harmonic vocoder based on the MBE speech model that has been shown to work well in many applications. The MBE vocoder combines a harmonic representation for voiced speech with a flexible, frequency-dependent voicing structure based on the MBE speech model. This allows the MBE vocoder to produce natural sounding unvoiced speech and makes the MBE vocoder more robust to the presence of acoustic background noise. These properties allow the MBE vocoder to produce higher quality speech at low to medium data rates and have led to use of the MBE vocoder in a number of commercial mobile communication applications.
The MBE speech model represents segments of speech using a fundamental frequency corresponding to the pitch, a set of voicing metrics or decisions, and a set of spectral magnitudes corresponding to the frequency response of the vocal tract. The MBE model generalizes the traditional single V/UV decision per segment into a set of decisions, each representing the voicing state within a particular frequency band or region. Each frame is thereby divided into at least voiced and unvoiced frequency regions. This added flexibility in the voicing model allows the MBE model to better accommodate mixed voicing sounds, such as some voiced fricatives, allows a more accurate representation of speech that has been corrupted by acoustic background noise, and reduces the sensitivity to an error in any one decision. Extensive testing has shown that this generalization results in improved voice quality and intelligibility.
MBE-based vocoders include the IMBE™ speech coder and the AMBE® speech coder. The IMBE™ speech coder has been used in a number of wireless communications systems including APCO Project 25. The AMBE® speech coder is an improved system which includes a more robust method of estimating the excitation parameters (fundamental frequency and voicing decisions), and which is better able to track the variations and noise found in actual speech. Typically, the AMBE® speech coder uses a filter bank that often includes sixteen channels and a non-linearity to produce a set of channel outputs from which the excitation parameters can be reliably estimated. The channel outputs are combined and processed to estimate the fundamental frequency. Thereafter, the channels within each of several (e.g., eight) voicing bands are processed to estimate a voicing decision (or other voicing metrics) for each voicing band. In the AMBE+2™ vocoder, a three-state voicing model (voiced, unvoiced, pulsed) is applied to better represent plosive and other transient speech sounds. Various methods for quantizing the MBE model parameters have been applied in different systems. Typically the AMBE® vocoder and AMBE+2™ vocoder employ more advanced quantization methods, such as vector quantization, that produce higher quality speech at lower bit rates.
The encoder of an MBE-based speech coder estimates the set of model parameters for each speech segment. The MBE model parameters include a fundamental frequency (the reciprocal of the pitch period); a set of V/UV metrics or decisions that characterize the voicing state; and a set of spectral magnitudes that characterize the spectral envelope. After estimating the MBE model parameters for each segment, the encoder quantizes the parameters to produce a frame of bits. The encoder optionally may protect these bits with error correction/detection codes before interleaving and transmitting the resulting bit stream to a corresponding decoder.
The decoder in an MBE-based vocoder reconstructs the MBE model parameters (fundamental frequency, voicing information and spectral magnitudes) for each segment of speech from the received bit stream. As part of this reconstruction, the decoder may perform deinterleaving and error control decoding to correct and/or detect bit errors. In addition, phase regeneration is typically performed by the decoder to compute synthetic phase information. In one method, which is specified in the APCO Project 25 Vocoder Description and described in U.S. Pat. Nos. 5,081,681 and 5,664,051, random phase regeneration is used, with the amount of randomness depending on the voicing decisions. In another method, phase regeneration is performed by applying a smoothing kernel to the reconstructed spectral magnitudes as is described in U.S. Pat. No. 5,701,390.
The decoder uses the reconstructed MBE model parameters to synthesize a speech signal that perceptually resembles the original speech to a high degree. Normally separate signal components, corresponding to voiced, unvoiced, and optionally pulsed speech, are synthesized for each segment, and the resulting components are then added together to form the synthetic speech signal. This process is repeated for each segment of speech to reproduce the complete speech signal for output through a D-to-A converter and a loudspeaker. The unvoiced signal component may be synthesized using a windowed overlap-add method to filter a white noise signal. The time-varying spectral envelope of the filter is determined from the sequence of reconstructed spectral magnitudes in frequency regions designated as unvoiced, with other frequency regions being set to zero.
The decoder may synthesize the voiced signal component using one of several methods. In one method, specified in the APCO Project 25 Vocoder Description, a bank of harmonic oscillators is used, with one oscillator assigned to each harmonic of the fundamental frequency, and the contributions from all of the oscillators are summed to form the voiced signal component. In another method, the voiced signal component is synthesized by convolving a voiced impulse response with an impulse sequence and then combining the contribution from neighboring segments with windowed overlap add. This second method may be faster to compute, since it does not require any matching of components between segments, and it may be applied to the optional pulsed signal component.
One particular example of an MBE based vocoder is the 7200 bps IMBE™ vocoder selected as a standard for the APCO Project 25 mobile radio communication system. This vocoder, described in the APCO Project 25 Vocoder Description, uses 144 bits to represent each 20 ms frame. These bits are divided into 56 redundant FEC bits (applied by a combination of Golay and Hamming coding), 1 synchronization bit and 87 MBE parameter bits. The 87 MBE parameter bits consist of 8 bits to quantize the fundamental frequency, 3-12 bits to quantize the binary voiced/unvoiced decisions, and 67-76 bits to quantize the spectral magnitudes. The resulting 144 bit frame is transmitted from the encoder to the decoder. The decoder performs error correction before reconstructing the MBE model parameters from the error decoded bits. The decoder then uses the reconstructed model parameters to synthesize voiced and unvoiced signal components which are added together to form the decoded speech signal.