The invention relates to electronic devices and digital signal processing, and more particularly to speech encoding and decoding.
The performance of digital speech systems using low bit rates has become increasingly important with current and foreseeable digital communications. Both dedicated channel and packetized voice-over-internet protocol (VoIP) transmission benefit from compression of speech signals. The widely-used linear prediction (LP) digital speech coding method models the vocal tract as a time-varying filter and a time-varying excitation of the filter to mimic human speech. Linear prediction analysis determines LP coefficients a(j), j=1, 2, . . . , M, for an input frame of digital speech samples {s(n)} by settingr(n)=s(n)−ΣM≧j≧1a(j)s(n−j)  (1)and minimizing Σframer(n)2. Typically, M, the order of the linear prediction filter, is taken to be about 10-12; the sampling rate to form the samples s(n) is typically taken to be 8 kHz (the same as the public switched telephone network (PSTN) sampling for digital transmission and which corresponds to a voiceband of about 0.3-3.4 kHz); and the number of samples {s(n)} in a frame is often 80 or 160 (10 or 20 ms frames). Various windowing operations may be applied to the samples of the input speech frame. The name “linear prediction” arises from the interpretation of the residual r(n)=s(n)−ΣM≧j≧1a(j)s(n−j) as the error in predicting s(n) by a linear combination of preceding speech samples ΣM≧j≧1a(j)s(n−j); that is, a linear autoregression. Thus minimizing Σframer(n)2 yields the {a(j)} which furnish the best linear prediction. The coefficients {a(j)} may be converted to line spectral frequencies (LSFs) or immittance spectrum pairs (ISPs) for vector quantization plus transmission and/or storage.
The {r(n)} form the LP residual for the frame, and ideally the LP residual would be the excitation for the synthesis filter 1/A(z) where A(z) is the transfer function of equation (1); that is, equation (1) is a convolution which z-transforms to multiplication: R(z)=A(z)S(z), so S(z)=R(z)/A(z). Of course, the LP residual is not available at the decoder; thus the task of the encoder is to represent the LP residual so that the decoder can generate an excitation for the LP synthesis filter. That is, from the encoded parameters the decoder generates a filter estimate, Â(z), plus an estimate of the residual to use as an excitation, E(z); and thereby estimates the speech frame by Ŝ(z)=E(z)/Â(z). Physiologically, for voiced frames the excitation roughly has the form of a series of pulses at the pitch frequency, and for unvoiced frames the excitation roughly has the form of white noise.
For compression the LP approach basically quantizes various parameters and only transmits/stores updates or codebook entries for these quantized parameters, filter coefficients, pitch lag, residual waveform, and gains. A receiver regenerates the speech with the same perceptual characteristics as the input speech. Periodic updating of the quantized items requires fewer bits than direct representation of the speech signal, so a reasonable LP coder can operate at bits rates as low as 2-3 kb/s (kilobits per second).
Indeed, the Adaptive Multirate Wideband (AMR-WB) standard with available bit rates ranging from 6.6 kb/s up to 23.85 kb/s uses LP analysis with codebook excitation (CELP) to compress speech. FIGS. 2a-2b illustrate the AMR-WB encoder functional blocks. The adaptive-codebook contribution provides periodicity in the excitation and is the product of a gain, gP, multiplied by v(n), the excitation of the prior frame translated by the pitch lag of the current frame and interpolated. The algebraic codebook contribution approximates the difference between the actual residual and the adaptive codebook contribution with a multiple-pulse vector (innovation sequence), c(n), multiplied by a gain, gC; the number of pulses depends upon the bit rate. That is, the excitation is u(n)=gPv(n)+gCc(n) where v(n) comes from the prior (decoded) frame and gP, gC, and c(n) come from the transmitted parameters for the current frame. The speech synthesized from the excitation is then postfiltered to mask noise. Postfiltering essentially comprises three successive filters: a short-term filter, a long-term filter, and a tilt compensation filter. The short-term filter emphasizes the formants; the long-term filter emphasizes periodicity, and the tilt compensation filter compensates for the spectral tilt typical of the short-term filter. See Bessette et al, The Adaptive Multirate Wideband Speech Codec (AMR-WB), 10 IEEE Tran. Speech and Audio Processing 620 (2002).
Further, FIG. 3 heuristically illustrates a layered (embedded) CELP encoder, such as the MPEG-4 audio CELP, which provides bit rate scalability with an output bitstream consisting of a core (base) layer (adaptive codebook together with fixed codebook 0) plus N enhancement layers (fixed codebooks 1 through N). A layered encoder uses only the core layer at the lowest bit rate to give acceptable quality and provides progressively enhanced quality by adding progressively more enhancement layers to the core layer. Find a layer's fixed codebook entry by minimization of the error between the input speech and the so-far cumulative synthesized speech. This layering is useful for some voice over Internet Protocol (VoIP) applications including different Quality of Service (QoS) offerings, network congestion control, and multicasting. For the different QoS service offerings, a layered coder can provide several options of bit rate by increasing or decreasing the number of enhancement layers. For the network congestion control, a network node can strip off some enhancement layers and lower the bit rate to ease network congestion. For multicasting, a receiver can retrieve appropriate number of bits from a single layer-structured bitstream according to its connection to the network.
CELP coders apparently perform well in the 6-16 kb/s bit rates often found with VoIP transmissions. However, known CELP coders perform less well at higher bit rates in a layered (embedded) coding design. A non-embedded CELP coder can optimize its parameters for best performance at a specific bit rate. Most parameters (e.g., pitch resolution, allowed fixed-codebook pulse positions, codebook gains, perceptual weighting, level of post-processing) are optimized to the operating bit rate. In an embedded coder, optimization for a specific bit rate is limited as the coder performance is evaluated at many bit rates. Furthermore, in CELP-like coders, there is a bit-rate penalty associated with the embedded constraint, a non-embedded coder can jointly quantize some of its parameters, e.g., fixed-codebook pulse positions, while an embedded coder cannot. In an embedded coder extra bits are also needed to encode the gains that correspond to the different bit rates, which require additional bits. Typically, the more embedded enhancement layers that are considered, the larger the bit-rate penalties, and so for a given bit rate, non-embedded coders outperform embedded coders.