The invention is directed to encoding and decoding speech or other audio signals.
Speech encoding and decoding have a large number of applications and have been studied extensively. In general, speech coding, which is often referred to as speech compression, seeks to reduce the data rate needed to represent a speech signal without substantially reducing the quality or intelligibility of the speech. Speech compression techniques may be implemented by a speech coder.
A speech coder is generally viewed as including an encoder and a decoder. The encoder produces a compressed stream of bits from a digital representation of speech, which may be generated by using an analog-to-digital converter to sample and digitize an analog speech signal produced by a microphone. The decoder converts the compressed bit stream into a digital representation of speech that is suitable for playback through a digital-to-analog converter and a speaker. In many applications, the encoder and decoder are physically separated, and the bit stream is transmitted between them using a communication channel. Alternatively, the bit stream may be stored in a computer or other memory for decoding and playback at a later time.
A key parameter of a speech coder is the amount of compression the coder achieves, which is measured by the bit rate of the stream of bits produced by the encoder. The bit rate of the encoder is generally a function of the desired fidelity (i.e., speech quality) and the type of speech coder employed. Different types of speech coders have been designed to operate at different bit rates. Medium to low rate speech coders operating below 10 kbps (kilobits per second) have received attention with respect to a wide range of mobile communication applications, such as cellular telephony, satellite telephony, land mobile radio, and in-flight telephony. These applications typically require high quality speech and robustness to artifacts caused by acoustic noise and channel noise (e.g., bit errors).
A well known approach for coding speech at medium to low data rates is based around linear predictive coding (LPC), which attempts to predict each new frame of speech from previous samples using short and/or long term predictors. The prediction error is typically quantized using one of several approaches of which CELP and/or multi-pulse are two examples. The linear prediction method has good time resolution, which is helpful for the coding of unvoiced sounds. In particular, plosives and transients benefit from the time resolution in that they are not overly smeared in time. However, linear prediction often has difficulty for voiced sounds, since the coded speech tends to sound rough or hoarse due to insufficient periodicity in the coded signal. This is particularly true at lower data rates, which typically require a longer frame size and employ a long-term predictor that is less effective at reproducing the periodic portion (i.e., the voiced portion) of speech.
Another well known approach for low to medium rate speech coding is a model-based speech coder, which is often referred to as a vocoder. A vocoder usually models speech as the response of some system to an excitation signal over short time intervals. Examples of vocoder systems include linear prediction vocoders, such as MELP or LPC-10, homomorphic vocoders, channel vocoders, sinusoidal transform coders (xe2x80x9cSTCxe2x80x9d), harmonic vocoder and multiband excitation (xe2x80x9cMBExe2x80x9d) vocoders. In these vocoders, speech is divided into short segments (typically 10-40 ms), and each segment is characterized by a set of model parameters. These parameters typically represent a few basic elements of each speech segment, such as the segment""s pitch, voicing state, and spectral envelope. A vocoder may use one of a number of known representations for each of these parameters. For example, the pitch may be represented as a pitch period, a fundamental frequency, or a long-term prediction delay. Similarly, the voicing state may be represented by one or more voicing metrics, by a voicing probability measure, or by a ratio of periodic to stochastic energy. The spectral envelope is often represented by an all-pole filter response, but also may be represented by a set of spectral magnitudes, cepstral coefficients, or other spectral measurements.
Since they permit a speech segment to be represented using only a small number of parameters, model-based speech coders, such as vocoders, typically are able to operate at lower data rates. However, the quality of a model-based system is dependent on the accuracy of the underlying model. Accordingly, a high fidelity model must be used if these speech coders are to achieve high speech quality.
One vocoder which has been shown to work well for certain types of speech is the harmonic vocoder. The harmonic vocoder is generally able to accurately model voiced speech, which is generally periodic over some short time interval. The harmonic vocoder represents each short segment of speech with a pitch period and some form of vocal tract response. Often, one or both of these parameters are converted into the frequency domain, and represented as a fundamental frequency and a spectral envelope. A speech segment can be synthesized in a harmonic vocoder by summing a sequence of harmonically related sine waves having frequencies at multiples of the fundamental frequency and amplitudes matching the spectral envelope. Harmonic vocoders often have difficult handling unvoiced speech, which is not easily modeled with a sparse collection of sine waves. Early harmonic vocoders handled unvoiced speech indirectly, without the use of any explicit voicing information, through a residual signal computed from the difference between the original speech and the harmonically-modeled speech. This residual signal was coded along with the model parameters, which lead to a relatively high total bit rate, or it was dropped, which led to relatively low quality. In another approach, a single voiced/unvoiced decision was used for an entire frame, with model parameters being added for voiced frames and the spectrum being coded for unvoiced frames. Problems with this approach resulted from the insufficiency of a single voicing decision for the entire frame (many segments of speech are voiced in some regions while being unvoiced in other regions), and from the sensitivity of the system to a voicing error which would negatively affect the entire frame. Previous harmonic coding schemes also suffered from the need to code the harmonic phases for voiced speech, and from not using critically sampled spectral representations for the unvoiced speech. These limitations reduced the number of bits available to code the other parameters, such as the harmonic magnitudes. As a result, the frame sizes were increased to around 30 ms to ensure that sufficient bits were available for all of the parameters at a reasonable total bit rate. Unfortunately, the use of a large frame size decreased time resolution in the system, which limited performance for unvoiced sounds and transients.
One improvement to early harmonic vocoders was introduced in the form of the Multiband Excitation (MBE) speech model. This model combines a harmonic representation for voiced speech with a flexible, frequency-dependent voicing structure that allows it to produce natural sounding unvoiced speech, and which makes it more robust to the presence of acoustic background noise. These properties allow the MBE model to produce higher quality speech at low to medium data rates, and have led to its use in a number of commercial mobile communication applications.
The MBE speech model represents segments of speech using a fundamental frequency representing the pitch, a set of binary voiced/unvoiced (V/UV) decisions or other voicing metrics, and a set of spectral magnitudes representing the frequency response of the vocal tract. The MBE model generalizes the traditional single V/UV decision per segment into a set of decisions, each representing the voicing state within a particular frequency band or region. Each frame is thereby divided into voiced and unvoiced regions. This added flexibility in the voicing model allows the MBE model to better accommodate mixed voicing sounds, such as some voiced fricatives, allows a more accurate representation of speech that has been corrupted by acoustic background noise, and reduces the sensitivity to an error in any one decision. Extensive testing has shown that this generalization results in improved voice quality and intelligibility.
The encoder of an MBE-based speech coder estimates the set of model parameters for each speech segment. The MBE model parameters include a fundamental frequency (the reciprocal of the pitch period); a set of V/UV metrics or decisions that characterize the voicing state; and a set of spectral magnitudes that characterize the spectral envelope. After estimating the MBE model parameters for each segment, the encoder quantizes the parameters to produce a frame of bits. The encoder optionally may protect these bits with error correction/detection codes before interleaving and transmitting the resulting bit stream to a corresponding decoder.
The decoder converts the received bit stream back into individual frames. As part of this conversion, the decoder may perform deinterleaving and error control decoding to correct or detect bit errors. The decoder then uses the frames of bits to reconstruct the MBE model parameters, which the decoder uses to synthesize a speech signal that is perceptually close to the original speech. The decoder may synthesize separate voiced and unvoiced components, and then may add the voiced and unvoiced components to produce the final speech signal.
In MBE-based systems, the encoder uses a spectral magnitude to represent the spectral envelope at each harmonic of the estimated fundamental frequency. The encoder then estimates a spectral magnitude for each harmonic frequency. Each harmonic is designated as being either voiced or unvoiced, depending upon whether the frequency band containing the corresponding harmonic has been declared voiced or unvoiced. When a harmonic frequency has been designated as being voiced, the encoder may use a magnitude estimator that differs from the magnitude estimator used when a harmonic frequency has been designated as being unvoiced. However, the spectral magnitudes generally are estimated independently of the voicing decisions. To do this, the speech coder computes a fast Fourier transform (xe2x80x9cFFTxe2x80x9d) for each windowed subframe of speech and averages the energy over frequency regions that are multiples of the estimated fundamental frequency. This approach may further include compensation to remove artifacts introduced by the FFT sampling grid from the estimated spectral magnitudes.
At the decoder, the voiced and unvoiced harmonics are identified, and separate voiced and unvoiced components are synthesized using different procedures. The unvoiced component may be synthesized using a weighted overlap-add method to filter a white noise signal. The filter used by the method sets to zero all frequency bands designated as voiced while otherwise matching the spectral magnitudes for regions designated as unvoiced. The voiced component is synthesized using a tuned oscillator bank, with one oscillator assigned to each harmonic that has been designated as being voiced. The instantaneous amplitude, frequency and phase are interpolated to match the corresponding parameters at neighboring segments. While early MBE-based systems included phase information in the bits received by the decoder, one significant improvement incorporated into later MBE-based systems is a phase synthesis method that allows the decoder to regenerate the phase information used in the synthesis of voiced speech without explicitly requiring any phase information to be transmitted by the encoder. Random phase synthesis based upon the voicing decisions may be applied, as in the case of the IMBE(trademark) speech coder. Alternatively, the decoder may apply a smoothing kernel to the reconstructed spectral magnitudes to produce phase information that may be perceptually closer to that of the original speech than is the randomly produced phase information. Such phase regeneration methods allow more bits to be allocated to other parameters and enable shorter frame sizes, which increases time resolution.
MBE-based vocoders include the IMBE(trademark) speech coder and the AMBE(copyright) speech coder. The AMBE(copyright) speech coder was developed as an improvement on earlier MBE-based techniques and includes a more robust method of estimating the excitation parameters (fundamental frequency and voicing decisions). The method is better able to track the variations and noise found in actual speech. The AMBE(copyright) speech coder uses a filter bank that typically includes sixteen channels and a non-linearity to produce a set of channel outputs from which the excitation parameters can be reliably estimated. The channel outputs are combined and processed to estimate the fundamental frequency. Thereafter, the channels within each of several (e.g., eight) voicing bands are processed to estimate a voicing decision (or other voicing metrics) for each voicing band.
Certain MBE-based vocoders, such as the AMBE(copyright) speech coder discussed above, are able to produce speech which sounds very close to the original speech. In particular voiced sounds are very smooth and periodic and do not exhibit the roughness or hoarseness typically associated with the linear predictive speech coders. Tests have shown that a 4 kbps AMBE(copyright) speech coder can equal the performance of CELP type coders operating at twice the rate. However the AMBE(copyright) vocoder still exhibits some distortion in unvoiced sounds due to excessive time spreading. This is due in part to the use in the unvoiced synthesis of an arbitrary white noise signal, which is uncorrelated with the original speech signal. This prevents the unvoiced component from localizing any transient sound within the segment. Hence, a short attack or small pulse of energy is spread out over the whole segment, which results in a xe2x80x9cslushyxe2x80x9d sound in the reconstructed signal.
The techniques noted above are described, for example, in: Flanagan, Speech Analysis, Synthesis and Perception, Springer-Verlag, 1972, pages 378-386 (describes a frequency-based speech analysis-synthesis system); Jayant et al., Digital Coding of Waveforms, Prentice-Hall, 1984 (describing speech coding in general); U.S. Pat. No. 4,885,790 (describes a sinusoidal processing method); U.S. Pat. No. 5,054,072 (describes a sinusoidal coding method); Tribolet et al., xe2x80x9cFrequency Domain Coding of Speechxe2x80x9d, IEEE TASSP, Vol. ASSP-27, No 5, October 1979, pages 512-530 (describes speech specific ATC); Almeida et al., xe2x80x9cNonstationary Modeling of Voiced Speechxe2x80x9d, IEEE TASSP, Vol. ASSP-31, No. 3, June 1983, pages 664-677, (describes harmonic modeling and an associated coder); Almeida et al., xe2x80x9cVariable-Frequency Synthesis: An Improved Harmonic Coding Schemexe2x80x9d, IEEE Proc. ICASSP 84, pages 27.5.1-27.5.4, (describes a polynomial voiced synthesis method); Rodrigues et al., xe2x80x9cHarmonic Coding at 8 KBITS/SECxe2x80x9d, Proc. ICASSP 87, pages 1621-1624, (describes a harmonic coding method); Quatieri et al., xe2x80x9cSpeech Transformations Based on a Sinusoidal Representationxe2x80x9d, IEEE TASSP, Vol. ASSP-34, No. 6, December 1986, pages 1449-1986 (describes an analysis-synthesis technique based on a sinusoidal representation); McAulay et al., xe2x80x9cMid-Rate Coding Based on a Sinusoidal Representation of Speechxe2x80x9d, Proc. ICASSP 85, pages 945-948, Tampa, Fla, Mar. 26-29, 1985 (describes a sinusoidal transform speech coder); Griffin, xe2x80x9cMultiband Excitation Vocoderxe2x80x9d, Ph.D. Thesis, M.I.T, 1988 (describes the MBE speech model and an 8000 bps MBE speech coder); Hardwick, xe2x80x9cA 4.8 kbps Multi-Band Excitation Speech Coderxe2x80x9d, SM. Thesis, M.I.T, May 1988 (describes a 4800 bps MBE speech coder); Hardwick, xe2x80x9cThe Dual Excitation Speech Modelxe2x80x9d, Ph.D. Thesis, M.I.T, 1992 (describes the dual excitation speech model); Princen et al., xe2x80x9cSubband/Transform Coding Using Filter Bank Designs Based on Time Domain Aliasing Cancellationxe2x80x9d, IEEE Proc. ICASSP ""87, pages 2161-2164 (describes modified cosine transform using TDAC principles); Telecommunications Industry Association (TIA), xe2x80x9cAPCO Project 25 Vocoder Descriptionxe2x80x9d, Version 1.3, Jul. 15, 1993, IS102BABA (describes a 7.2 kbps IMBE(trademark) speech coder for APCO Project 25 standard), all of which are incorporated by reference.
The invention provides improved coding techniques for speech or other signals. The techniques combine a multiband harmonic vocoder for voiced sounds with a new method for coding unvoiced sounds which is better able to handle transients. This results in improved speech quality at lower data rates. The techniques have wide applicability to digital voice communications including such applications as cellular telephony, digital radio, and satellite communications.
In one general aspect, the techniques feature encoding a speech signal into a set of encoded bits. The speech signal is digitized to produce a sequence of digital speech samples that are divided into a sequence of frames, with each of the frames spanning multiple digital speech samples. A set of speech model parameters then is estimated for a frame. The speech model parameters include voicing parameters dividing the frame into voiced and unvoiced regions, at least one pitch parameter representing pitch for at least the voiced regions of the frame, and spectral parameters representing spectral information for at least the voiced regions of the frame. The speech model parameters are quantized to produce parameter bits.
The frame is also divided into one or more subframes, and transform coefficients are computed for the digital speech samples representing the subframes. The transform coefficients in unvoiced regions of the frame are quantized to produce transform bits. The parameter bits and the transform bits are included in the set of encoded bits.
Embodiments may include one or more of the following features. For example, when the frame is divided into frequency bands, and the voicing parameters include binary voicing decisions for frequency bands of the frame, the division into voiced and unvoiced regions may designate at least one frequency band as being voiced and one frequency band as being unvoiced. For some frames, all of the frequency bands may be designated as voiced or all may be designated as unvoiced.
The spectral parameters for the frame may include one or more sets of spectral magnitudes estimated for both voiced and unvoiced regions in a manner which is independent of the voicing parameters for the frame. When the spectral parameters for the frame include two or more sets of spectral magnitudes, they may be quantized by companding all sets of spectral magnitudes in the frame to produce sets of companded spectral magnitudes using a companding operation such as the logarithm, quantizing the last set of the companded spectral magnitudes in the frame, interpolating between the quantized last set of companded spectral magnitudes in the frame and a quantized set of companded spectral magnitudes from a prior frame to form interpolated spectral magnitudes, determining a difference between a set of companded spectral magnitudes and the interpolated spectral magnitudes, and quantizing the determined difference between the spectral magnitudes. The spectral magnitudes may be computed by windowing the digital speech samples to produce windowed speech samples, computing an FFT of the windowed speech samples to produce FFT coefficients, summing energy in the FFT coefficients around multiples of a fundamental frequency corresponding to the pitch parameter, and computing the spectral magnitudes as square roots of the summed energies.
The transform coefficients may be computed using a transform possessing critical sampling and perfect reconstruction properties. For example, the transform coefficients may be computed using an overlapped transform that computes transform coefficients for neighboring subframes using overlapping windows of the digital speech samples.
The quantizing of the transform coefficients to produce transform bits may include computing a spectral envelope for the subframe from the model parameters, forming multiple sets of candidate coefficients, with each set of candidate coefficients being formed by combining one or more candidate vectors and multiplying the combined candidate vectors by the spectral envelope, selecting from the multiple sets of candidate coefficients the set of candidate coefficients which is closest to the transform coefficients, and including the index of the selected set of candidate coefficients in the transform bits. Each candidate vector may be formed from an offset into a known prototype vector and a number of sign bits, with each sign bit changing the sign of one or more elements of the candidate vector. The selected set of candidate coefficients may be the set from the multiple sets of candidate coefficients with the highest correlation with the transform coefficients.
Quantizing of the transform coefficients to produce transform bits may further include computing a best scale factor for the selected candidate vectors of the subframe, quantizing the scale factors for the subframes in the frame to produce scale factor bits, and including the scale factor bits in the transform bits. Scale factors for different subframes in the frame may be jointly quantized to produce the scale factor bits. The joint quantization may use a vector quantizer.
The number of bits in the set of encoded bits for one frame in the sequence of frames may be different than the number of bits in the set of encoded bits for a second frame in the sequence of frames. To this end, the encoding may further include selecting the number of bits in the set of encoded bits, wherein the number may vary from frame to frame, and allocating the selected number of bits between the parameters bits and the transform bits. Selecting the number of bits in the set of encoded bits for a frame may be based at least in part on the degree of change between the spectral magnitude parameters representing the spectral information in the frame and the previous spectral magnitude parameters representing the spectral information in the previous frame. A greater number of bits may be favored when the degree of change is larger, and a smaller number of bits may be favored when the degree of change is smaller.
The encoding techniques may be implemented by an encoder. The encoder may include a dividing element that divides the digital speech samples into a sequence of frames, each of the frames including multiple digital speech samples, and a speech model parameter estimator that estimates a set of speech model parameters for a frame. The speech model parameters may include voicing parameters dividing the frame into voiced and unvoiced regions, at least one pitch parameter representing pitch for at least the voiced regions of the frame, and spectral parameters representing spectral information for at least the voiced regions of the frame. The encoder also may include a parameter quantizer that quantizes the model parameters to produce parameter bits, a transform coefficient generator that divides the frame into one or more subframes and computes transform coefficients for the digital speech samples representing the subframes, a transform coefficient quantizer that quantizes the transform coefficients in unvoiced regions of the frame to produce transform bits, and a combiner that combines the parameter bits and the transform bits to produce the set of encoded bits. One, more than one, or all of the elements of the encoder may be implemented by a digital signal processor.
In another general aspect, a frame of digital speech samples is decoded from a set of encoded bits by extracting model parameter bits from the set of encoded bits and reconstructing model parameters representing the frame of digital speech samples from the extracted model parameter bits. The model parameters include voicing parameters dividing the frame into voiced and unvoiced regions, at least one pitch parameter representing the pitch information for at least the voiced regions of the frame, and spectral parameters representing spectral information for at least the voiced regions of the frame. Voiced speech samples for the frame are reproduced from the reconstructed model parameters.
Transform coefficient bits are also extracted from the set of encoded bits. Transform coefficients representing unvoiced regions of the frame are reconstructed from the extracted transform coefficient bits. The reconstructed transform coefficients are inverse transformed to produce inverse transform samples from which unvoiced speech for the frame is produced. The voiced speech for the frame and the unvoiced speech for the frame are combined to produce the decoded frame of digital speech samples.
Embodiments may include one or more of the following features. For example, when the frame is divided into frequency bands, and the voicing parameters include binary voicing decisions for frequency bands of the frame, the division into voiced and unvoiced regions designates at least one frequency band as being voiced and one frequency band as being unvoiced.
The pitch parameter and the spectral parameters for the frame may include one or more fundamental frequencies and one or more sets of spectral magnitudes. The voiced speech samples for the frame may be produced using synthetic phase information computed from the spectral magnitudes, and may be produced at least in part by a bank of harmonic oscillators. For example, a low frequency portion of the voiced speech samples may be produced by a bank of harmonic oscillators and a high frequency portion of the voiced speech samples may be produced using an inverse FFT with interpolation, wherein the interpolation is based at least in part on the pitch information for the frame.
The decoding may further include dividing the frame into subframes, separating the reconstructed transform coefficients into groups, each group of reconstructed transform coefficients being associated with a different subframe in the frame, inverse transforming the reconstructed transform coefficients in a group to produce inverse transform samples associated with the corresponding subframe, and overlapping and adding the inverse transform samples associated with consecutive subframes to produce unvoiced speech for the frame. The inverse transform samples may be computed using the inverse of an overlapped transform possessing both critical sampling and perfect reconstruction properties.
The reconstructed transform coefficients may be produced from the transform coefficient bits by computing a spectral envelope from the reconstructed model parameters, reconstructing one or more candidate vectors from the transform coefficient bits, and forming reconstructed transform coefficients by combining the candidate vectors and multiplying the combined candidate vectors by the spectral envelope. A candidate vector may be reconstructed from the transform coefficient bits by use of an offset into a known prototype vector and a number of sign bits, wherein each sign bit changes the sign of one or more elements of the candidate vector.
The decoding techniques may be implemented by a decoder. The decoder may include a model parameter extractor that extracts model parameter bits from the set of encoded bits and a model parameter reconstructor that reconstructs model parameters representing the frame of digital speech samples from the extracted model parameter bits. The model parameters may include voicing parameters dividing the frame into voiced and unvoiced regions, at least one pitch parameter representing the pitch information for at least the voiced regions of the frame, and spectral parameters representing spectral information for at least the voiced regions of the frame. The decoder also may include a voiced speech synthesizer that produces voiced speech samples for the frame from the reconstructed model parameters, a transform coefficient extractor that extracts transform coefficient bits from the set of encoded bits, a transform coefficient reconstructor that reconstructs transform coefficients representing unvoiced regions of the frame from the extracted transform coefficient bits, an inverse transformer that inverse transforms the reconstructed transform coefficients to produce inverse transform samples, an unvoiced speech synthesizer that synthesizes unvoiced speech for the frame from the inverse transform samples, and a combiner that combines the voiced speech for the frame and the unvoiced speech for the frame to produce the decoded frame of digital speech samples. One, more than one, or all of the elements of the encoder may be implemented by a digital signal processor.
In another general aspect, speech model parameters including a voicing parameter, at least one pitch parameter representing pitch for a frame, and spectral parameters representing spectral information for the frame are estimated and quantized to produce parameter bits. The frame is then divided into one or more subframes and transform coefficients for the digital speech samples representing the subframes are computed using a transform possessing critical sampling and perfect reconstruction properties. At least some of the transform coefficients are quantized to produce transform bits that are included with the parameter bits in a set of encoded bits.
In yet another general aspect, a frame of digital speech samples is decoded from a set of encoded bits by extracting model parameter bits from the set of encoded bits, reconstructing model parameters representing the frame of digital speech samples from the extracted model parameter bits, and producing voiced speech samples for the frame using the reconstructed model parameters. In addition, transform coefficient bits are extracted from the set of encoded bits to reconstruct transform coefficients that are inverse transformed to produce inverse transform samples. The inverse transform samples are produced using the a inverse of an overlapped transform possessing both critical sampling and perfect reconstruction properties. Unvoiced speech for the frame is produced from the inverse transform samples, and is combined with the voiced speech to produce the decoded frame of digital speech samples.
In yet another general aspect, a speech signal is encoded into a set of encoded bits by digitizing the speech signal to produce a sequence of digital speech samples that are divided into a sequence of frames that each span multiple samples. A set of speech model parameters is estimated for a frame. The speech model parameters include a voicing parameter, at least one pitch parameter representing pitch for the frame, and spectral parameters representing spectral information for the frame, the spectral parameters including one or more sets of spectral magnitudes estimated in a manner which is independent of the voicing parameter for the frame. The model parameters are quantized to produce parameter bits.
The frame is divided into one or more subframes and transform coefficients are computed for the digital speech samples representing the subframes. At least some of the transform coefficients are quantized to produce transform bits that are included with the parameter bits in the set of encoded bits.
In yet another general aspect, a frame of digital speech samples is decoded from a set of encoded bits. Model parameter bits are extracted from the set of encoded bits, and model parameters representing the frame of digital speech samples from the extracted model parameter bits are reconstructed. The model parameters include a voicing parameter, at least one pitch parameter representing pitch information for the frame, and spectral parameters representing spectral information for the frame. Voiced speech samples are produced for the frame using the reconstructed model parameters and synthetic phase information computed from the spectral magnitudes.
In addition, transform coefficient bits are extracted from the set of encoded bits, and transform coefficients are reconstructed from the extracted transform coefficient bits. The reconstructed transform coefficients are inverse transformed to produce inverse transform samples. Finally, unvoiced speech for the frame is produced from the inverse transform samples and combined with the voiced speech to produce the decoded frame of digital speech samples.
Other advantages will be apparent from the following description, including the drawings, and from the claims.