The invention relates to electronic devices, and more particularly to speech coding, transmission, storage, and decoding/synthesis methods and circuitry.
The performance of digital speech systems using low bit rates has become increasingly important with current and foreseeable digital communications. Both dedicated channel and packetized-over-network (e.g., Voice over IP or Voice over Packet) transmissions benefit from compression of speech signals. The widely-used linear prediction (LP) digital speech coding compression method models the vocal tract as a time-varying filter and a time-varying excitation of the filter to mimic human speech. Linear prediction analysis determines LP coefficients ai, i=1, 2, . . . , M, for an input frame of digital speech samples {s(n)} by settingr(n)=s(n)+ΣM≧i≧1ais(−i)  (1)and minimizing the energy Σr(n )2 of the residual r(n) in the frame. Typically, M, the order of the linear prediction filter, is taken to be about 10-12; the sampling rate to form the samples s(n) is typically taken to be 8 kHz (the same as the public switched telephone network sampling for digital transmission); and the number of samples {s(n)} in a frame is typically 80 or 160 (10 or 20 ms frames). A frame of samples may be generated by various windowing operations applied to the input speech samples. The name “linear prediction” arises from the interpretation of r(n)=s(n)+ΣM≧i≧1 ai s(n−i) as the error in predicting s(n) by the linear combination of preceding speech samples −ΣM≧i≧1 ai s(n−i). Thus minimizing Σr(n)2 yields the {ai} which furnish the best linear prediction for the frame. The coefficients {ai} may be converted to line spectral frequencies (LSFs) for quantization and transmission or storage and converted to line spectral pairs (LSPs) for interpolation between subframes.
The {r(n)} is the LP residual for the frame, and ideally the LP residual would be the excitation for the synthesis filter 1/A(z) where A(z) is the transfer function of equation (1). Of course, the LP residual is not available at the decoder; thus the task of the encoder is to represent the LP residual so that the decoder can generate an excitation which emulates the LP residual from the encoded parameters. Physiologically, for voiced frames the excitation roughly has the form of a series of pulses at the pitch frequency, and for unvoiced frames the excitation roughly has the form of white noise.
The LP compression approach basically only transmits/stores updates for the (quantized) filter coefficients, the (quantized) residual (waveform or parameters such as pitch), and (quantized) gain(s). A receiver decodes the transmitted/stored items and regenerates the input speech with the same perceptual characteristics. Periodic updating of the quantized items requires fewer bits than direct representation of the speech signal, so a reasonable LP coder can operate at bits rates as low as 2-3 kb/s (kilobits per second). In more detail, the ITU standard G.729 uses frames of 10 ms length (80 samples) divided into two 5-ms 40-sample subframes for better tracking of pitch and gain parameters plus reduced codebook search complexity. Each subframe has an excitation represented by an adaptive-codebook contribution plus a fixed (algebraic) codebook contribution, and thus the name CELP for code-excited linear prediction. The adaptive-codebook contribution provides periodicity in the excitation and is the product of v(n), the prior frame's excitation translated by the current frame's pitch lag in time and interpolated, multiplied by a gain, gP. The algebraic codebook contribution approximates the difference between the actual residual and the adaptive codebook contribution with a four-pulse vector, c(n), multiplied by a gain, gC. Thus the excitation is u(n)=gP v(n)+gC c(n) where v(n) comes from the prior (decoded) frame and gP, gC, and c(n) come from the transmitted parameters for the current frame. The speech synthesized from the excitation is then postfiltered. to mask noise. Postfiltering essentially comprises three successive filters: a short-term filter, a long-term filter, and a tilt compensation filter. The short-term filter emphasizes the formants; the long-term filter emphasizes periodicity, and the tilt compensation filter compensates for the spectral tilt typical of the short-term filter.
Further, as illustrated in FIGS. 2a-2b a layered coding such as the MPEG-4 audio CELP encoder/decoder provides bit rate scalability with an output bitstream consisting of a base layer (adaptive codebook together with fixed codebook 0) plus N enhancement layers (fixed codebooks 1 through N). A layered encoder uses only the base layer at the lowest bit rate to give acceptable quality and provides progressively enhanced quality by adding progressively more enhancement layers to the base layer. This layering is useful for some voice over packet (VoP) applications including different Quality of Service (QoS) offerings, network congestion control, and multicasting. For the different QoS service offerings, a layered coder can provide several options of bit rate by increasing or decreasing the number of enhancement layers. For the network congestion control, a network node can strip off some enhancement layers and lower the bit rate to ease network congestion. For multicasting, a receiver can retrieve appropriate number of bits from a single layer-structured bitstream according to its connection to the network.
CELP coders apparently perform well in the 6-16 kb/s bit rates often found with VoIP transmissions. However, known CELP coders perform less well at higher bit rates in a layered coding design, probably because the transmitter does not know how many layers will be decoded at the receiver.