The invention relates to electronic devices, and, more particularly, to speech coding, transmission, storage, and decoding/synthesis methods and circuitry.
The performance of digital speech systems using low bit rates has become increasingly important with current and foreseeable digital communications. Both dedicated channel and packetized-over-network (e.g., Voice over IP) transmission benefit from compression of speech signals. The widely-used linear prediction (LP) digital speech coding compression method models the vocal tract as a time-varying filter and a time-varying excitation of the filter to mimic human speech. Linear prediction analysis determines LP coefficients a(j), j=1, 2, . . . , M, for an input frame of digital speech samples {s(n)} by setting
r(n)=s(n)xe2x88x92xcexa3Mxe2x89xa7jxe2x89xa71a(j)s(nxe2x88x92j)xe2x80x83xe2x80x83(1)
and minimizing xcexa3r(n)2. Typically, M, the order of the linear prediction filter, is taken to be about 10-12; the sampling rate to form the samples s(n) is typically taken to be 8 kHz (the same as the public switched telephone network (PSTN) sampling for digital transmission); and the number of samples {s(n)} in a frame is often 80 or 160 (10 or 20 ms frames). A frame of samples may be generated by various windowing operations applied to the input speech samples. The name xe2x80x9clinear predictionxe2x80x9d arises from the interpretation of r(n)=s(n)xe2x88x92xcexa3Mxe2x89xa7jxe2x89xa71a(j)s(nxe2x88x92j) as the error in predicting s(n) by the linear combination of preceding speech samples xcexa3Mxe2x89xa7jxe2x89xa71a(j)s(nxe2x88x92j). Thus minimizing xcexa3r(n)2 yields the {a(j)} which furnish the best linear prediction. The coefficients {a(j)} may be converted to line spectral frequencies (LSFs) for quantization and transmission or storage.
The {r(n)} form the LP residual for the frame and ideally would be the excitation for the synthesis filter 1/A(z) where A(z) is the transfer function of equation (1). Of course, the LP residual is not available at the decoder; so the task of the encoder is to represent the LP residual so that the decoder can generate the LP excitation from the encoded parameters. Physiologically, for voiced frames the excitation roughly has the form of a series of pulses at the pitch frequency, and for unvoiced frames the excitation roughly has the form of white noise.
The LP compression approach basically only transmits/stores updates for the (quantized) filter coefficients, the (quantized) residual (waveform or parameters such as pitch), and the (quantized) gain. A receiver regenerates the speech with the same perceptual characteristics as the input speech. Periodic updating of the quantized items requires fewer bits than direct representation of the speech signal, so a reasonable LP coder can operate at bits rates as low as 2-3 kb/s (kilobits per second).
Indeed, the ITU standard G.729 with a bit rate of 8 kb/s uses LP analysis with codebook excitation (CELP) to compress voiceband speech and has performance comparable to that of the 32 kb/s ADPCM in the G.726 standard. In particular, G.729 uses frames of 10 ms length divided into two 5 ms subframes for better tracking of pitch and gain parameters plus reduced codebook search complexity. The second subframe of a frame uses quantized and unquantized LP coefficients while the first subframe interpolates LP coefficients. Each subframe has an excitation represented by an adaptive-codebook part and a fixed-codebook part: the adaptive-codebook part represents the periodicity in the excitation signal using a fractional pitch lag with resolution of 1/3 sample and the fixed-codebook represents the difference between the synthesized residual and the adaptive-codebook representation. 10th order LP analysis with LSF quantization takes 18 bits.
G.729 handles frame erasures by reconstruction based on previously received information. Namely, replace the missing excitation signal with one of similar characteristics, while gradually decaying its energy by using a voicing classifier based on the long-term prediction gain, which is computed as part of the long-term postfilter analysis. The long-term postfilter sues the long-term filter with a lag that gives a normalized correlation greater than 0.5. For the error concealment process, a 10 ms frame is declared periodic if at least one 5 ms subframe has a long-term prediction gain of more than 3 dB. Otherwise the frame is declared nonperiodic. An erased frame inherits its class from the preceding (reconstructed) speech frame. Note that the voicing classification is continuously updated based on this reconstructed speech signal.
Leung et al, Voice Frame Reconstruction Methods for CELP Speech Coders in Digital Cellular and Wireless Communications, Proc. Wireless 93 (July 1993) describes missing frame reconstruction using parametric extrapolation and interpolation for a low complexity CELP coder using 4 subframes per frame. In particular, Leung et al proceeds as follows: For frame gain, perform scalar linear extrapolation or interpolation. For LPC coefficients, perform vector linear extrapolation or interpolation (i.e., matrices of extrapolation or interpolation acting of vectors of LPC coefficients to yield reconstructed LPC coefficients). For pitch lag and adaptive codebook coefficients (which are generated for each of the 4 subframes per frame), do median filtering to reconstruct the pitch lag (adjust the pitch search to insure a smooth pitch contour); and adopt a conditional repeat strategy to reconstruct the adaptive codebook coefficients. That is, a voicing decision is made initially for the missing frame by comparing the pitch lag median with the pitch lags in the previous and possibly future frames. If over half of the lags (4 per frame) are within xc2x15 samples from the median value, the missing frame is declared as voiced. The coefficients can be reconstructed according to one of three methods: (1) if the missing frame is estimated to be unvoiced, then select the scaled version of the coefficients associated with the pitch lag median, (2) if the missing frame is voiced and extrapolation used, then a scaled version of the coefficients of the last subframe of the preceding frame is used, and (3) if the missing frame is voiced and interpolation used, then a scaled version of the coefficient from either the last subframe of the preceding frame or the first subframe of the next frame could be used depending upon whether the pitch median comes from the preceding frame or the next frame. For stochastic excitation gain (generated for each subframe) do vector linear extrapolation or interpolation (i.e., matrices of extrapolation or interpolation acting of vectors of gains to yield reconstructed gains). For stochastic codebook parameters chose random values because of the lesser perceptual importance of these parameters and the fact of the relatively unpredictable behavior of the stochastic excitation.
However, this extrapolation or interpolation method does not apply to differentially quantized parameters.
The present invention provides concealment of erased frames which had been differentially quantized by the use of nonlinear interpolation of prior and future received frame information.
This has advantages including the preferred embodiment use of the time delay and future frame availability of a playout buffer (e.g., as in packetized CELP-encoded voice transmission over a network, including VoIP) for estimating missing parameters for concealment.