The performance of digital speech systems using low bit rates has become increasingly important with current and foreseeable digital communications. Both dedicated channel and packetized voice-over-internet protocol (VoIP) transmission benefit from compression of speech signals. The widely-used linear prediction (LP) digital speech coding method (see, e.g., Schroeder, et al., “Code-Excited Linear Prediction (CELP): High Quality Speech at Very Low Bit Rates,” in Proc. IEEE Int. Conf, on Acoustics, Speech, Signal Processing, (Tampa), pp. 937-940, March 1985) models the vocal tract as a time-varying filter and a time-varying excitation of the filter to mimic human speech. Linear prediction analysis determines linear prediction (LP) coefficients a(j), j=1, 2, . . . , M, for an input frame of digital speech samples {s(n)} by setting:r(n)=s(n)−ΣM≧j≧1a(j)s(n−j)  (1)and minimizing Σframer(n)2. Typically, M, the order of the linear prediction filter, is taken to be about 10-12; the sampling rate to form the samples s(n) is typically taken to be 8 kHz (the same as the public switched telephone network, or PSTN, sampling for digital transmission and which corresponds to a voiceband of about 0.3-3.4 kHz); and the number of samples {s(n)} in a frame is often 80 or 160 (10 or 20 ms frames). Various windowing operations may be applied to the samples of the input speech frame. The name “linear prediction” arises from the interpretation of the residual r(n)=s(n)−ΣM≧j≧1a(j)s(n−j) as the error in predicting s(n) by a linear combination of preceding speech samples ΣM≧j≧1a(j)s(n−j); that is, a linear autoregression. Thus minimizing Σframer(n)2 yields the {a(j)} which furnish the best linear prediction. The coefficients {a(j)} may be converted to line spectral frequencies (LSFs) or immittance spectrum pairs (ISPs) for vector quantization plus transmission and/or storage.
The {r(n)} form the LP residual for the frame, and ideally the LP residual would be the excitation for the synthesis filter 1/A(z) where A(z) is the transfer function of Equation (1); that is, Equation (1) is a convolution that z-transforms to a multiplication: R(z)=A(z)S(z), so S(z)=R(z)/A(z). Of course, the LP residual is not available at the decoder; thus the task of the encoder is to represent the LP residual so that the decoder can generate an excitation for the LP synthesis filter. That is, from the encoded parameters the decoder generates a filter estimate, A(z), plus an estimate of the residual to use as an excitation, E(z); and thereby estimates the speech frame by Ŝ(z)=E(z)/Â(z). Physiologically, for voiced frames the excitation roughly has the form of a series of pulses at the pitch frequency, and for unvoiced frames the excitation roughly has the form of white noise.
For compression the LP approach basically quantizes various parameters and only transmits/stores updates or codebook entries for these quantized parameters, filter coefficients, pitch lag, residual waveform, and gains. A receiver regenerates the speech with the same perceptual characteristics as the input speech. Periodic updating of the quantized items requires fewer bits than direct representation of the speech signal, so a reasonable LP encoder can operate at bits rates as low as 2-3 kb/s (kilobits per second).
For example, the Adaptive Multirate Wideband (AMR-WB) encoding standard with available bit rates ranging from 6.6 kb/s up to 23.85 kb/s uses LP analysis with codebook excitation (CELP) to compress speech. An adaptive-codebook contribution provides periodicity in the excitation and is the product of a gain, gP, multiplied by v(n), the excitation of the prior frame translated by the pitch lag of the current frame and interpolated to fit the current frame. The algebraic codebook contribution approximates the difference between the actual residual and the adaptive codebook contribution with a multiple-pulse vector (also known as an innovation sequence), c(n), multiplied by a gain, gC. The number of pulses depends on the bit rate. That is, the excitation is u(n)=gPv(n)+gCc(n) where v(n) comes from the prior (decoded) frame, and gP, gC, and c(n) come from the transmitted parameters for the current frame. The speech synthesized from the excitation is then postfiltered to mask noise. Postfiltering essentially involves three successive filters: a short-term filter, a long-term filter, and a tilt compensation filter. The short-term filter emphasizes formants; the long-term filter emphasizes periodicity, and the tilt compensation filter compensates for the spectral tilt typical of the short-term filter. See, e.g., Bessette, et al., The Adaptive Multirate Wideband Speech Codec (AMR-VVB), 10 IEEE Tran. Speech and Audio Processing 620 (2002).
A layered (embedded) CELP speech encoder, such as the MPEG-4 audio CELP, provides bit rate scalability with an output bitstream consisting of a core (or base) layer (an adaptive codebook together with a fixed codebook 0) plus N enhancement layers (fixed codebooks 1 through N). For a general discussion on fixed (or algebraic) codebooks, see, e.g., Adoui, et al., “Fast CELP Coding Based on Algebraic Codes,” in Proc. IEEE Int. Conf on Acoustics, Speech, Signal Processing, (Dallas), pp. 1957-1960, April 1987.
A layered encoder uses only the core layer at the lowest bit rate to give acceptable quality and provides progressively enhanced quality by adding progressively more enhancement layers to the core layer. A layer's fixed codebook entry is found by minimizing the error between the input speech and the so-far cumulative synthesized speech. Layering is useful for some Voice-over-Internet-Protocol (VoIP) applications including different Quality-of-Service (QoS) offerings, network congestion control and multicasting. For different QoS service offerings, a layered encoder can provide several options of bit rate by increasing or decreasing the number of enhancement layers. For network congestion control, a network node can strip off some enhancement layers and lower the bit rate to ease network congestion. For multicasting, a receiver can retrieve appropriate number of bits from a single layer-structured bitstream according to its connection to the network.
CELP speech encoders apparently perform well in the 6-16 kb/s bit rates often found with VoIP transmissions. However, known CELP speech encoders that employ a layered (embedded) coding design do not perform as well at higher bit rates. A non-layered CELP speech encoder can optimize its parameters for best performance at a specific bit rate. Most parameters (e.g., pitch resolution, allowed fixed-codebook pulse positions, codebook gains, perceptual weighting, level of post-processing) are typically optimized to the operating bit rate. In a layered encoder, optimization for a specific bit rate is limited as the encoder performance is evaluated at many bit rates. Furthermore, CELP-like encoders incur a bit-rate penalty with the embedded constraint; a non-layered encoder can jointly quantize some of its parameters (e.g., fixed-codebook pulse positions), while a layered encoder cannot. In a layered encoder extra bits are also needed to encode the gains that correspond to the different bit rates, which require additional bits. Typically, the more embedded enhancement layers that are considered, the larger the bit-rate penalties. So for a given bit rate, non-layered encoders outperform layered encoders.