Speech encoding and decoding have a large number of applications and have been studied extensively. In general, speech coding, which is also known as speech compression, seeks to reduce the data rate needed to represent a speech signal without substantially reducing the quality or intelligibility of the speech. Speech compression techniques may be implemented by a speech coder, which also may be referred to as a voice coder or vocoder.
A speech coder is generally viewed as including an encoder and a decoder. The encoder produces a compressed stream of bits from a digital representation of speech, such as may be generated at the output of an analog-to-digital converter having as an input an analog signal produced by a microphone. The decoder converts the compressed bit stream into a digital representation of speech that is suitable for playback through a digital-to-analog converter and a speaker. In many applications, the encoder and the decoder are physically separated, and the bit stream is transmitted between them using a communication channel.
A key parameter of a speech coder is the amount of compression the coder achieves, which is measured by the bit rate of the stream of bits produced by the encoder. The bit rate of the encoder is generally a function of the desired fidelity (i.e., speech quality) and the type of speech coder employed. Different types of speech coders have been designed to operate at different bit rates. Recently, low to medium rate speech coders operating below 10 kbps have received attention with respect to a wide range of mobile communication applications (e.g., cellular telephony, satellite telephony, land mobile radio, and in-flight telephony). These applications typically require high quality speech and robustness to artifacts caused by acoustic noise and channel noise (e.g., bit errors).
Speech is generally considered to be a non-stationary signal having signal properties that change over time. This change in signal properties is generally linked to changes made in the properties of a person's vocal tract to produce different sounds. A sound is typically sustained for some short period, typically 10–100 ms, and then the vocal tract is changed again to produce the next sound. The transition between sounds may be slow and continuous or it may be rapid as in the case of a speech “onset.” This change in signal properties increases the difficulty of encoding speech at lower bit rates since some sounds are inherently more difficult to encode than others and the speech coder must be able to encode all sounds with reasonable fidelity while preserving the ability to adapt to a transition in the characteristics of the speech signals. One way to improve the performance of a low to medium bit rate speech coder is to allow the bit rate to vary. In variable-bit-rate speech coders, the bit rate for each segment of speech is allowed to vary between two or more options depending on various factors, such as user input, system loading, terminal design or signal characteristics.
There have been several main approaches for coding speech at low to medium data rates. For example, an approach based around linear predictive coding (LPC) attempts to predict each new frame of speech from previous samples using short and long term predictors. The prediction error is typically quantized using one of several approaches of which CELP and/or multi-pulse are two examples. The advantage of the linear prediction method is that it has good time resolution, which is helpful for the coding of unvoiced sounds. In particular, plosives and transients benefit from this in that they are not overly smeared in time. However, linear prediction typically has difficulty for voiced sounds in that the coded speech tends to sound rough or hoarse due to insufficient periodicity in the coded signal. This problem may be more significant at lower data rates that typically require a longer frame size and for which the long-term predictor is less effective at restoring periodicity.
Another leading approach for low to medium rate speech coding is a model-based speech coder or vocoder. A vocoder models speech as the response of a system to excitation over short time intervals. Examples of vocoder systems include linear prediction vocoders such as MELP, homomorphic vocoders, channel vocoders, sinusoidal transform coders (“STC”), harmonic vocoders and multiband excitation (“MBE”) vocoders. In these vocoders, speech is divided into short segments (typically 10–40 ms), with each segment being characterized by a set of model parameters. These parameters typically represent a few basic elements of each speech segment, such as the segment's pitch, voicing state, and spectral envelope. A vocoder may use one of a number of known representations for each of these parameters. For example, the pitch may be represented as a pitch period, a fundamental frequency or pitch frequency (which is the inverse of the pitch period), or as a long-term prediction delay. Similarly, the voicing state may be represented by one or more voicing metrics, by a voicing probability measure, or by a set of voicing decisions. The spectral envelope is often represented by an all-pole filter response, but also may be represented by a set of spectral magnitudes or other spectral measurements. Since they permit a speech segment to be represented using only a small number of parameters, model-based speech coders, such as vocoders, typically are able to operate at medium to low data rates. However, the quality of a model-based system is dependent on the accuracy of the underlying model. Accordingly, a high fidelity model must be used if these speech coders are to achieve high speech quality.
The MBE vocoder is a harmonic vocoder based on the MBE speech model that has been shown to work well in many applications. The MBE vocoder combines a harmonic representation for voiced speech with a flexible, frequency-dependent voicing structure based on the MBE speech model. This allows the MBE vocoder to produce natural sounding unvoiced speech and makes the MBE vocoder more robust to the presence of acoustic background noise. These properties allow the MBE vocoder to produce higher quality speech at low to medium data rates and have led to its use in a number of commercial mobile communication applications.
The MBE speech model represents segments of speech using a fundamental frequency corresponding to the pitch, a set of voicing metrics or decisions, and a set of spectral magnitudes corresponding to the frequency response of the vocal tract. The MBE model generalizes the traditional single V/UV decision per segment into a set of decisions, each representing the voicing state within a particular frequency band or region. Each frame is thereby divided into at least voiced and unvoiced frequency regions. This added flexibility in the voicing model allows the MBE model to better accommodate mixed voicing sounds, such as some voiced fricatives, allows a more accurate representation of speech that has been corrupted by acoustic background noise, and reduces the sensitivity to an error in any one decision. Extensive testing has shown that this generalization results in improved voice quality and intelligibility.
MBE-based vocoders include the IMBE™ speech coder and the AMBE® speech coder. The IMBE™ speech coder has been used in a number of wireless communications systems including the APCO Project 25 mobile radio standard. The AMBE® speech coder is an improved system which includes a more robust method of estimating the excitation parameters (fundamental frequency and voicing decisions), and which is better able to track the variations and noise found in actual speech. Typically, the AMBE® speech coder uses a filter bank that typically includes sixteen channels and a non-linearity to produce a set of channel outputs from which the excitation parameters can be reliably estimated. The channel outputs are combined and processed to estimate the fundamental frequency. Thereafter, the channels within each of several (e.g., eight) voicing bands are processed to estimate a binary voicing decision for each voicing band. In the AMBE+2™ vocoder, a three-state voicing model (voiced, unvoiced, pulsed) is applied to better represent plosive and other transient speech sounds. Various methods for quantizing the MBE model parameters have been applied in different systems. Typically the AMBE® vocoder and AMBE+2™ vocoder employ more advanced quantization methods, such as vector quantization, that produce higher quality speech at lower bit rates.
The encoder of an MBE-based speech coder estimates the set of model parameters for each speech segment. The MBE model parameters include a fundamental frequency (the reciprocal of the pitch period); a set of V/UV metrics or decisions that characterize the voicing state; and a set of spectral magnitudes that characterize the spectral envelope. After estimating the MBE model parameters for each segment, the encoder quantizes the parameters to produce a frame of bits. The encoder optionally may protect these bits with error correction/detection codes before interleaving and transmitting the resulting bit stream to a corresponding decoder.
The decoder in an MBE-based vocoder reconstructs the MBE model parameters (fundamental frequency, voicing information and spectral magnitudes) for each segment of speech from the received bit stream. As part of this reconstruction, the decoder may perform deinterleaving and error control decoding to correct and/or detect bit errors. In addition, the decoder typically performs phase regeneration to compute synthetic phase information. For example, in a method specified in the APCO Project 25 Vocoder Description and described in U.S. Pat. Nos. 5,081,681 and 5,664,051, random phase regeneration is used, with the amount of randomness depending on the voicing decisions. In another method, phase regeneration is performed by applying a smoothing kernel to the reconstructed spectral magnitudes as described in U.S. Pat. No. 5,701,390.
The decoder uses the reconstructed MBE model parameters to synthesize a speech signal that perceptually resembles the original speech to a high degree. Normally, separate signal components, corresponding to voiced, unvoiced, and optionally pulsed speech, are synthesized for each segment, and the resulting components are then added together to form the synthetic speech signal. This process is repeated for each segment of speech to reproduce the complete speech signal, which can then be output through a D-to-A converter and a loudspeaker. The unvoiced signal component may be synthesized using a windowed overlap-add method to filter a white noise signal. The time-varying spectral envelope of the filter is determined from the sequence of reconstructed spectral magnitudes in frequency regions designated as unvoiced, with other frequency regions being set to zero.
The decoder may synthesize the voiced signal component using one of several methods. In one method, specified in the APCO Project 25 Vocoder Description (EIA/TIA standard document IS102BABA, herein incorporated by reference), a bank of harmonic oscillators is used, with one oscillator assigned to each harmonic of the fundamental frequency, and the contributions from all of the oscillators is summed to form the voiced signal component. In another method, as described in co-pending U.S. patent application Ser. No. 10/046,666, filed Jan. 16, 2002, which is incorporated by reference, the voiced signal component is synthesized by convolving a voiced impulse response with an impulse sequence and then combining the contribution from neighboring segments with windowed overlap add. This second method has the advantage of being faster to compute since it does not require any matching of components between segments, and it has the further advantage that it can be applied to the optional pulsed signal component.
One particular example of an MBE based vocoder is the 7200 bps IMBE™ vocoder selected as a standard for the APCO Project 25 mobile radio communication system. This vocoder, described in the APCO Project 25 Vocoder Description, uses 144 bits to represent each 20 ms frame. These bits are divided into 56 redundant FEC bits (applied as a combination of Golay and Hamming codes), 1 synchronization bit and 87 MBE parameter bits. The 87 MBE parameter bits consist of 8 bits to quantize the fundamental frequency, 3–12 bits to quantize the binary voiced/unvoiced decisions, and 67–76 bits to quantize the spectral magnitudes. The resulting 144 bit frame is transmitted from the encoder to the decoder. The decoder performs error correction decoding before reconstructing the MBE model parameters from the error-decoded bits. The decoder then uses the reconstructed model parameters to synthesize voiced and unvoiced signal components which are added together to form the decoded speech signal.
Subsequent to the development of the APCO Project 25 communication system, several advances in vocoder technology have been developed. These advanced methods allow new MBE-based vocoders to achieve higher voice quality at lower bit rates. For example, a state of the art MBE vocoder operating at 3600 bps can provide better performance than the standard 7200 bps APCO Project 25 vocoder even though it operates at half the data rate. The much lower data rate for the half-rate vocoder can provide much better communications efficiency (i.e., the amount of RF spectrum required for transmission) compared to the standard full-rate vocoder. However, use of a half-rate vocoder (or any other vocoder which is not bit stream compatible with the standard vocoder) in second generation radio devices creates interoperability issues if they have to communicate to existing radios that use the standard full-rate vocoder. In order to provide interoperability between the two radios using different vocoders, the system infrastructure (i.e., the base station or repeater) must convert or transcode between the two different vocoders. The traditional method of performing this conversion is to receive the encoded bit stream from the first radio, decode the bit stream back into a speech signal using the appropriate decoder, re-encode this speech signal back to a bit stream using the second encoder and then transmit the re-encoded bit stream to the second radio. This process is commonly referred to as tandem transcoding or tandeming, because the net effect is that both vocoders are applied back-to-back (i.e., in tandem).
An alternative digital-to-digital conversion method is presented in the context of a multi-speaker conferencing system in U.S. Pat. Nos. 5,383,184, 5,272,698, 5,457,685 and 5,317,567. This system includes a conferencing bridge that may interface vocoders operating at different bit rates without tandeming. In this application, the conferencing bridge measures the bit rate associated with each of several users, combines and converts all the bit streams, and sends the results back to each user at their particular bit rate. The bit rate conversion process in the conferencing bridge operates by reencoding the cepstral coefficients that represent the spectral envelope for each frame.