Speech encoding and decoding have a large number of applications and have been studied extensively. In general, speech coding, which is also known as speech compression, seeks to reduce the data rate needed to represent a speech signal without substantially reducing the quality or intelligibility of the speech. Speech compression techniques may be implemented by a speech coder, which also may be referred to as a voice coder or vocoder.
A speech coder is generally viewed as including an encoder and a decoder. The encoder produces a compressed stream of bits from a digital representation of speech, such as may be generated at the output of an analog-to-digital converter having as an input an analog signal produced by a microphone. The decoder converts the compressed bit stream into a digital representation of speech that is suitable for playback through a digital-to-analog converter and a speaker. In many applications, the encoder and decoder are physically separated, and the bit stream is transmitted between them using a communication channel.
A key parameter of a speech coder is the amount of compression the coder achieves, which is measured by the bit rate of the stream of bits produced by the encoder. The bit rate of the encoder is generally a function of the desired fidelity (i.e., speech quality) and the type of speech coder employed. Different types of speech coders have been designed to operate at different bit rates. Recently, low-to-medium rate speech coders operating below 10 kbps have received attention with respect to a wide range of mobile communication applications (e.g., cellular telephony, satellite telephony, land mobile radio, and in-flight telephony). These applications typically require high quality speech and robustness to artifacts caused by acoustic noise and channel noise (e.g., bit errors).
Speech is generally considered to be a non-stationary signal having signal properties that change over time. This change in signal properties is generally linked to changes made in the properties of a person's vocal tract to produce different sounds. A sound is typically sustained for some short period, typically 10-100 ms, and then the vocal tract is changed again to produce the next sound. The transition between sounds may be slow and continuous or it may be rapid as in the case of a speech “onset”. This change in signal properties increases the difficulty of encoding speech at lower bit rates since some sounds are inherently more difficult to encode than others and the speech coder must be able to encode all sounds with reasonable fidelity while preserving the ability to adapt to a transition in the speech signals characteristics. One way to improve the performance of a low-to-medium bit rate speech coder is to allow the bit rate to vary. In variable-bit-rate speech coders, the bit rate for each segment of speech is not fixed, but is allowed to vary between two or more options depending on the signal characteristics. This type of adaption can be applied to many different types of speech coders (or coders for other non-stationary signals, such as audio coders and video coders) with favorable results. Typically, the limitation in a communication system is that the system must be able to handle the different bit rates without interrupting the communications or degrading system performance.
There have been several main approaches for coding speech at low-to-medium data rates. For example, an approach based around linear predictive coding (LPC) attempts to predict each new frame of speech from previous samples using short and long term predictors. The prediction error is typically quantized using one of several approaches of which CELP and/or multi-pulse are two examples. The advantage of the linear prediction method is that it has good time resolution, which is helpful for the coding of unvoiced sounds. In particular, plosives and transients benefit from this in that they are not overly smeared in time. However, linear prediction typically has difficulty for voiced sounds in that the coded speech tends to sound rough or hoarse due to insufficient periodicity in the coded signal. This problem may be more significant at lower data rates that typically require a longer frame size and for which the long-term predictor is less effective at restoring periodicity.
Another leading approach for low-to-medium rate speech coding is a model-based speech coder or vocoder. A vocoder models speech as the response of a system to excitation over short time intervals. Examples of vocoder systems include linear prediction vocoders such as MELP, homomorphic vocoders, channel vocoders, sinusoidal transform coders (“STC”), harmonic vocoders and multiband excitation (“MBE”) vocoders. In these vocoders, speech is divided into short segments (typically 10-40 ms), with each segment being characterized by a set of model parameters. These parameters typically represent a few basic elements of each speech segment, such as the segment's pitch, voicing state, and spectral envelope. A vocoder may use one of a number of known representations for each of these parameters. For example, the pitch may be represented as a pitch period, a fundamental frequency or pitch frequency (which is the inverse of the pitch period), or as a long-term prediction delay. Similarly, the voicing state may be represented by one or more voicing metrics, by a voicing probability measure, or by a set of voicing decisions. The spectral envelope is often represented by an all-pole filter response, but also may be represented by a set of spectral magnitudes or other spectral measurements. Since they permit a speech segment to be represented using only a small number of parameters, model-based speech coders, such as vocoders, typically are able to operate at medium to low data rates. However, the quality of a model-based system is dependent on the accuracy of the underlying model. Accordingly, a high fidelity model must be used if these speech coders are to achieve high speech quality.
One vocoder which has been shown to work well for many types of speech is the MBE vocoder which is basically a harmonic vocoder modified to use the Multi-Band Excitation (MBE) model. The MBE vocoder combines a harmonic representation for voiced speech with a flexible, frequency-dependent voicing structure that allows it to produce natural sounding unvoiced speech, and which makes it more robust to the presence of acoustic background noise. These properties allow the MBE model to produce higher quality speech at low to medium data rates and have led to its use in a number of commercial mobile communication applications.
The MBE speech model represents segments of speech using a fundamental frequency corresponding to the pitch, a set of voicing metrics or decisions, and a set of spectral magnitudes corresponding to the frequency response of the vocal tract. The MBE model generalizes the traditional single V/UV decision per segment into a set of decisions, each representing the voicing state within a particular frequency band or region. Each frame is thereby divided into voiced and unvoiced frequency regions. This added flexibility in the voicing model allows the MBE model to better accommodate mixed voicing sounds, such as some voiced fricatives, allows a more accurate representation of speech that has been corrupted by acoustic background noise, and reduces the sensitivity to an error in any one decision. Extensive testing has shown that this generalization results in improved voice quality and intelligibility.
The encoder of an MBE-based speech coder estimates the set of model parameters for each speech segment. The MBE model parameters include a fundamental frequency (the reciprocal of the pitch period); a set of V/UV metrics or decisions that characterize the voicing state; and a set of spectral magnitudes that characterize the spectral envelope. After estimating the MBE model parameters for each segment, the encoder quantizes the parameters to produce a frame of bits. The encoder optionally may protect these bits with error correction/detection codes before interleaving and transmitting the resulting bit stream to a corresponding decoder.
The decoder converts the received bit stream back into individual frames. As part of this conversion, the decoder may perform deinterleaving and error control decoding to correct or detect bit errors. The decoder then uses the frames of bits to reconstruct the MBE model parameters, which the decoder uses to synthesize a speech signal that perceptually resembles the original speech to a high degree.
MBE-based vocoders include the IMBE™ speech coder and the AMBE® speech coder. The AMBE® speech coder was developed as an improvement on earlier MBE-based techniques and includes a more robust method of estimating the excitation parameters (fundamental frequency and voicing decisions). The method is better able to track the variations and noise found in actual speech. The AMBE® speech coder uses a filter bank that typically includes sixteen channels and a non-linearity to produce a set of channel outputs from which the excitation parameters can be reliably estimated. The channel outputs are combined and processed to estimate the fundamental frequency. Thereafter, the channels within each of several (e.g., eight) voicing bands are processed to estimate a voicing decision (or other voicing metrics) for each voicing band.
Most MBE based speech coders employ a two-state voicing model (voiced and unvoiced) and each frequency region is determined to be either voiced or unvoiced. This system uses a set of binary voiced/unvoiced decisions to represent the voicing state of all the frequency regions in a frame of speech. In MBE-based systems, the encoder uses a spectral magnitude to represent the spectral envelope at each harmonic of the estimated fundamental frequency. The encoder then estimates a spectral magnitude for each harmonic frequency. Each harmonic is designated as being either voiced or unvoiced, depending upon the voicing state of the frequency band containing the harmonic. Typically, the spectral magnitudes are estimated independently of the voicing decisions. To do this, the speech encoder computes a fast Fourier transform (“FFT”) for each windowed subframe of speech and averages the energy over frequency regions that are multiples of the estimated fundamental frequency. This approach preferably includes compensation to remove from the estimated spectral magnitudes artifacts introduced by the FFT sampling grid.
At the decoder, the received voicing decisions are used to identify the voicing state of each harmonic of the received fundamental frequency. The decoder then synthesizes separate voiced and unvoiced signal components using different procedures. The unvoiced signal component is preferably synthesized using a windowed overlap-add method to filter a white noise signal. The spectral envelope of the filter is determined from the received spectral magnitudes in frequency regions designated as unvoiced, and is set to zero in frequency regions designated as voiced.
Early MBE-based systems estimated phase information at the encoder, quantized this phase information, and included the phase bits in the data received by the decoder. However, one significant improvement incorporated into later MBE-based systems is a phase synthesis method that allows the decoder to regenerate the phase information used in the synthesis of voiced signal components without explicitly requiring any phase information to be transmitted by the encoder. Such phase regeneration methods allow more bits to be allocated to other parameters, allow the bit rate to be reduced, and/or enable shorter frame sizes to thereby increase time resolution. Lower rate MBE vocoders typically use regenerated phase information. One type of phase regeneration is discussed by U.S. Pat. Nos. 5,081,681 and 5,664,051, both of which are incorporated by reference. In this approach, random phase synthesis is used with the amount of randomness depending on the voicing decisions. Alternatively, phase regeneration using minimum phase or using a smoothing kernel applied to the reconstructed spectral magnitudes can be employed. Such phase regeneration is described in U.S. Pat. No. 5,701,390, which is incorporated by reference.
The decoder may synthesize the voiced signal component using one of several methods. For example, a short-time Fourier synthesis method constructs a harmonic spectrum corresponding to a fundamental frequency and the spectral parameters for a particular frame. This spectrum is then converted into a time sequence, either directly or using an inverse FFT, and then combined with similarly-constructed time sequences from neighboring frames using windowed overlap-add. While this approach is relatively straightforward, it sounds distorted for longer (e.g., 20 ms) frame sizes. The source of this distortion is the interference caused by the changing fundamental frequency between neighboring frames. As the fundamental frequency changes, the pitch period alignment changes between the previous and next frames. This causes interference when these misaligned time sequences are combined using overlap-add. For longer frame sizes, this interference causes the synthesized speech to sound rough and distorted.
Another voiced speech synthesizer uses a set of harmonic oscillators, assigns one oscillator to each harmonic of the fundamental frequency, and sums the contributions from all of the oscillators to form the voiced signal component. The instantaneous amplitude and phase of each oscillator is allowed to change according to a low order polynomial (first order for the amplitude, third order for the phase is typical). The polynomial coefficients are computed such that the amplitude, phase and frequency equal the received values for the two frames at the boundaries of the synthesis interval, and the polynomial effectively interpolates these values between the frame boundaries. Each harmonic oscillator matches a single harmonic component between the next and previous frames. The synthesizer uses frequency ordered matching, in which the first oscillator matches the first harmonic between the previous and current frames, the second oscillator matches the second harmonic between the previous and current frames, and so on. Frequency order matching eliminates the interference and resulting distortion as the fundamental frequency slowly changes between frames (even for long frame sizes >20 ms). In a related voiced synthesis method, frequency ordered matching of harmonic components is used in the context of the MBE speech model.
An alternative approach to voiced speech synthesis synthesizes speech as the sum of arbitrary (i.e., not harmonically constrained) sinusoids that are estimated by peak-picking on the original speech spectrum. This method is specifically designed to not use the voicing state (i.e., there are no voiced, unvoiced or other frequency regions), which means that non-harmonic sine waves are important to obtain good quality speech. However, the use of non-harmonic frequencies introduces a number of complications for the synthesis algorithm. For example, simple frequency ordered matching (e.g., first harmonic to first harmonic, second harmonic to second harmonic) is insufficient since the arbitrary sine-wave model is not limited to harmonic frequencies. Instead, a nearest-neighbor matching method that matches a sinusoidal component in one frame to a component in the neighboring frame that is the closest to it in frequency may be used. For example, if the fundamental frequency drops between frames by a factor of two, then the nearest-neighbor matching method allows the first sinusoidal component in one frame to be matched with the second component in the next frame, then the second sinusoidal component may be matched with the fourth, the third sinusoidal component may be matched with the sixth, and so on. This nearest-neighbor approach matches components regardless of any shifts in frequency or spectral energy, but at the cost of higher complexity.
As described, one common method for voiced speech synthesis uses sinusoidal oscillators with polynomial amplitude and phase interpolation to enable production of high quality voiced speech as the voiced speech parameters changes between frames. However, such sinusoidal oscillator methods are generally quite complex because they may match components between frames and because they often compute the contribution for each oscillator separately and for typical telephone bandwidth speech there may be as many as 64 harmonics, or even more in methods that employ non-harmonic sinusoids. In contrast, windowed overlap-add methods do not require any components to be matched between frames, and are computationally much less complex. However, such methods can cause audible distortion, particularly for the longer frame sizes used in low rate coding. A hybrid synthesis method described in U.S. Pat. Nos. 5,195,166 and 5,581,656, which are incorporated by reference, combines these two techniques to produce a method that is computationally simpler than the harmonic oscillator method and which avoids the distortion of the windowed overlap-add method. In this hybrid method, the N lowest frequency harmonics (typically N=7) are synthesized using harmonic oscillators with frequency-ordered matching and polynomial interpolation. All remaining high frequency harmonics are synthesized using an inverse FFT with interpolation and windowed overlap-add. While this method reduces complexity and preserves voice quality, it still requires higher complexity than overlap-add alone because the low-frequency harmonics are still synthesized with harmonic oscillators. In addition, the size of the program that implements this method is increased because this method requires both synthesis methods to be implemented in the decoder.