Speech encoding and decoding have a large number of applications and have been studied extensively. In general, speech coding, which is also known as speech compression, seeks to reduce the data rate needed to represent a speech signal without substantially reducing the quality or intelligibility of the speech. Speech compression techniques may be implemented by a speech coder, which also may be referred to as a voice coder or vocoder.
A speech coder is generally viewed as including an encoder and a decoder. The encoder produces a compressed stream of bits from a digital representation of speech, such as may be generated at the output of an analog-to-digital converter having as an input an analog signal produced by a microphone. The decoder converts the compressed bit stream into a digital representation of speech that is suitable for playback through a digital-to-analog converter and a speaker. In many applications, the encoder and the decoder are physically separated, and the bit stream is transmitted between them using a communication channel.
A key parameter of a speech coder is the amount of compression the coder achieves, which is measured by the bit rate of the stream of bits produced by the encoder. The bit rate of the encoder is generally a function of the desired fidelity (i.e., speech quality) and the type of speech coder employed. Different types of speech coders have been designed to operate at different bit rates. Recently, low to medium rate speech coders operating below 10 kbps have received attention with respect to a wide range of mobile communication applications (e.g., cellular telephony, satellite telephony, land mobile radio, and in-flight telephony). These applications typically require high quality speech and robustness to artifacts caused by acoustic noise and channel noise (e.g., bit errors).
Speech is generally considered to be a non-stationary signal having signal properties that change over time. This change in signal properties is generally linked to changes made in the properties of a person's vocal tract to produce different sounds. A sound is typically sustained for some short period, typically 10-100 ms, and then the vocal tract is changed again to produce the next sound. The transition between sounds may be slow and continuous or it may be rapid as in the case of a speech “onset.” This change in signal properties increases the difficulty of encoding speech at lower bit rates since some sounds are inherently more difficult to encode than others and the speech coder must be able to encode all sounds with reasonable fidelity while preserving the ability to adapt to a transition in the characteristics of the speech signals. Performance of a low to medium bit rate speech coder can be improved by allowing the bit rate to vary. In variable-bit-rate speech coders, the bit rate for each segment of speech is allowed to vary between two or more options depending on various factors, such as user input, system loading, terminal design or signal characteristics.
There have been several main approaches for coding speech at low to medium data rates. For example, an approach based around linear predictive coding (LPC) attempts to predict each new frame of speech from previous samples using short and long term predictors. The prediction error is typically quantized using one of several approaches of which CELP and/or multi-pulse are two examples. The advantage of the linear prediction method is that it has good time resolution, which is helpful for the coding of unvoiced sounds. In particular, plosives and transients benefit from this in that they are not overly smeared in time. However, linear prediction typically has difficulty for voiced sounds in that the coded speech tends to sound rough or hoarse due to insufficient periodicity in the coded signal. This problem may be more significant at lower data rates that typically require a longer frame size and for which the long-term predictor is less effective at restoring periodicity.
Another leading approach for low to medium rate speech coding is a model-based speech coder or vocoder. A vocoder models speech as the response of a system to excitation over short time intervals. Examples of vocoder systems include linear prediction vocoders such as MELP, homomorphic vocoders, channel vocoders, sinusoidal transform coders (“STC”), harmonic vocoders and multiband excitation (“MBE”) vocoders. In these vocoders, speech is divided into short segments (typically 10-40 ms), with each segment being characterized by a set of model parameters. These parameters typically represent a few basic elements of each speech segment, such as the segment's pitch, voicing state, and spectral envelope. A vocoder may use one of a number of known representations for each of these parameters. For example, the pitch may be represented as a pitch period, a fundamental frequency or pitch frequency (which is the inverse of the pitch period), or a long-term prediction delay. Similarly, the voicing state may be represented by one or more voicing metrics, by a voicing probability measure, or by a set of voicing decisions. The spectral envelope is often represented by an all-pole filter response, but also may be represented by a set of spectral magnitudes or other spectral measurements. Since they permit a speech segment to be represented using only a small number of parameters, model-based speech coders, such as vocoders, typically are able to operate at medium to low data rates. However, the quality of a model-based system is dependent on the accuracy of the underlying model. Accordingly, a high fidelity model must be used if these speech coders are to achieve high speech quality.
The MBE vocoder is a harmonic vocoder based on the MBE speech model that has been shown to work well in many applications. The MBE vocoder combines a harmonic representation for voiced speech with a flexible, frequency-dependent voicing structure based on the MBE speech model. This allows the MBE vocoder to produce natural sounding unvoiced speech and makes the MBE vocoder more robust to the presence of acoustic background noise. These properties allow the MBE vocoder to produce higher quality speech at low to medium data rates and have led to its use in a number of commercial mobile communication applications.
The MBE speech model represents segments of speech using a fundamental frequency corresponding to the pitch, a set of voicing metrics or decisions, and a set of spectral magnitudes corresponding to the frequency response of the vocal tract. The MBE model generalizes the traditional single V/UV decision per segment into a set of decisions that each represent the voicing state within a particular frequency band or region. Each frame is thereby divided into at least voiced and unvoiced frequency regions. This added flexibility in the voicing model allows the MBE model to better accommodate mixed voicing sounds, such as some voiced fricatives, allows a more accurate representation of speech that has been corrupted by acoustic background noise, and reduces the sensitivity to an error in any one decision. Extensive testing has shown that this generalization results in improved voice quality and intelligibility.
MBE-based vocoders include the IMBE™ speech coder which has been used in a number of wireless communications systems including the APCO Project 25 (“P25”) mobile radio standard. This P25 vocoder standard consists of a 7200 bps IMBE™ vocoder that combines 4400 bps of compressed voice data with 2800 bps of Forward Error Control (FEC) data. It is documented in Telecommunications Industry Association (TIA) document TIA-102BABA, entitled “APCO Project 25 Vocoder Description,” which is incorporated by reference.
The encoder of a MBE-based speech coder estimates a set of model parameters for each speech segment or frame. The MBE model parameters include a fundamental frequency (the reciprocal of the pitch period); a set of V/UV metrics or decisions that characterize the voicing state; and a set of spectral magnitudes that characterize the spectral envelope. After estimating the MBE model parameters for each segment, the encoder quantizes the parameters to produce a frame of bits. The encoder optionally may protect these bits with error correction/detection codes (FEC) before interleaving and transmitting the resulting bit stream to a corresponding decoder.
The decoder in a MBE-based vocoder reconstructs the MBE model parameters (fundamental frequency, voicing information and spectral magnitudes) for each segment of speech from the received bit stream. As part of this reconstruction, the decoder may perform deinterleaving and error control decoding to correct and/or detect bit errors. In addition, the decoder typically performs phase regeneration to compute synthetic phase information. For example, in a method specified in the APCO Project 25 Vocoder Description and described in U.S. Pat. Nos. 5,081,681 and 5,664,051, random phase regeneration is used, with the amount of randomness depending on the voicing decisions.
The decoder uses the reconstructed MBE model parameters to synthesize a speech signal that perceptually resembles the original speech to a high degree. Normally, separate signal components, corresponding to voiced, unvoiced, and optionally pulsed speech, are synthesized for each segment, and the resulting components are then added together to form the synthetic speech signal. This process is repeated for each segment of speech to reproduce the complete speech signal, which can then be output through a D-to-A converter and a loudspeaker. The unvoiced signal component may be synthesized using a windowed overlap-add method to filter a white noise signal. The time-varying spectral envelope of the filter is determined from the sequence of reconstructed spectral magnitudes in frequency regions designated as unvoiced, with other frequency regions being set to zero.
The decoder may synthesize the voiced signal component using one of several methods. In one method, specified in the APCO Project 25 Vocoder Description, a bank of harmonic oscillators is used, with one oscillator assigned to each harmonic of the fundamental frequency, and the contributions from all of the oscillators is summed to form the voiced signal component.
The 7200 bps IMBE™ vocoder, standardized for the APCO Project 25 mobile radio communication system, uses 144 bits to represent each 20 ms frame. These bits are divided into 56 redundant FEC bits (applied as a combination of Golay and Hamming codes), 1 synchronization bit and 87 MBE parameter bits. The 87 MBE parameter bits consist of 8 bits to quantize the fundamental frequency, 3-12 bits to quantize the binary voiced/unvoiced decisions, and 67-76 bits to quantize the spectral magnitudes. The resulting 144 bit frame is transmitted from the encoder to the decoder. The decoder performs error correction decoding before reconstructing the MBE model parameters from the error-decoded bits. The decoder then uses the reconstructed model parameters to synthesize voiced and unvoiced signal components which are added together to form the decoded speech signal.