The invention relates to electronic devices, and more particularly to speech coding, transmission, storage, and decoding/synthesis methods and circuitry.
The performance of digital speech systems using low bit rates has become increasingly important with current and foreseeable digital communications. Both dedicated channel and packetized-over-network (e.g., Voice over IP or Voice over Packet) transmissions benefit from compression of speech signals. The widely-used linear prediction (LP) digital speech coding compression method models the vocal tract as a time-varying filter and a time-varying excitation of the filter to mimic human speech. Linear prediction analysis determines LP coefficients ai, i=1, 2, . . . , M, for an input frame of digital speech samples {s(n)} by settingr(n)=s(n)+ΣM≧i≧1ais(n−i)  (1)and minimizing the energy Σr(n)2 of the residual r(n) in the frame. Typically, M, the order of the linear prediction filter, is taken to be about 10-12; the sampling rate to form the samples s(n) is typically taken to be 8 kHz (the same as the public switched telephone network sampling for digital transmission); and the number of samples {s(n)} in a frame is typically 80 or 160 (10 or 20 ms frames). A frame of samples may be generated by various windowing operations applied to the input speech samples. The name “linear prediction” arises from the interpretation of r(n)=s(n)+ΣM≧i≧1ais(n−i) as the error in predicting s(n) by the linear combination of preceding speech samples −ΣM≧i≧1ais(n−i). Thus minimizing Σr(n)2 yields the {ai} which furnish the best linear prediction for the frame. The coefficients {ai} may be converted to line spectral frequencies (LSFs) for quantization and transmission or storage and converted to line spectral pairs (LSPs) for interpolation between subframes.
The {r(n)} is the LP residual for the frame, and ideally the LP residual would be the excitation for the synthesis filter 1/A(z) where A(z) is the transfer function of equation (1). Of course, the LP residual is not available at the decoder; thus the task of the encoder is to represent the LP residual so that the decoder can generate an excitation which emulates the LP residual from the encoded parameters. Physiologically, for voiced frames the excitation roughly has the form of a series of pulses at the pitch frequency, and for unvoiced frames the excitation roughly has the form of white noise.
The LP compression approach basically only transmits/stores updates for the (quantized) filter coefficients, the (quantized) residual (waveform or parameters such as pitch), and (quantized) gain(s). A receiver decodes the transmitted/stored items and regenerates the input speech with the same perceptual characteristics. Periodic updating of the quantized items requires fewer bits than direct representation of the speech signal, so a reasonable LP coder can operate at bits rates as low as 2-3 kb/s (kilobits per second).
However, high error rates in wireless transmission and large packet losses/delays for network transmissions demand that an LP decoder handle frames in which so many bits are corrupted that the frame is ignored (erased). To maintain speech quality and intelligibility for wireless or voice-over-packet applications in the case of erased frames, the decoder typically has methods to conceal such frame erasures, and such methods may be categorized as either interpolation-based or repetition-based. An interpolation-based concealment method exploits both future and past frame parameters to interpolate missing parameters. In general, interpolation-based methods provide better approximation of speech signals in missing frames than repetition-based methods which exploit only past frame parameters. In applications like wireless communications, the interpolation-based method has a cost of an additional delay to acquire the future frame. In Voice over Packet communications future frames are available from a playout buffer which compensates for arrival jitter of packets, and interpolation-based methods mainly increase the size of the playout buffer. Repetition-based concealment, which simply repeats or modifies the past frame parameters, finds use in several CELP-based speech coders including G.729, G.723.1, and GSM-EFR. The repetition-based concealment method in these coders does not introduce any additional delay or playout buffer size, but the performance of reconstructed speech with erased frames is poorer than that of the interpolation-based approach, especially in a high erased-frame ratio or bursty frame erasure environment.
In more detail, the ITU standard G.729 uses frames of 10 ms length (80 samples) divided into two 5-ms 40-sample subframes for better tracking of pitch and gain parameters plus reduced codebook search complexity. Each subframe has an excitation represented by an adaptive-codebook contribution and a fixed (algebraic) codebook contribution. The adaptive-codebook contribution provides periodicity in the excitation and is the product of v(n), the prior frame's excitation translated by the current frame's pitch lag in time and interpolated, multiplied by a gain, gP. The fixed codebook contribution approximates the difference between the actual residual and the adaptive codebook contribution with a four-pulse vector, c(n), multiplied by a gain, gC. Thus the excitation is u(n)=gPv(n)+gCc(n) where v(n) comes from the prior (decoded) frame and gP, gC, and c(n) come from the transmitted parameters for the current frame. FIGS. 3-4 illustrate the encoding and decoding in block format; the postfilter essentially emphasizes any periodicity (e.g., vowels).
G.729 handles frame erasures by reconstruction based on previously received information; that is, repetition-based concealment. Namely, replace the missing excitation signal with one of similar characteristics, while gradually decaying its energy by using a voicing classifier based on the long-term prediction gain (which is computed as part of the long-term postfilter analysis). The long-term postfilter finds the long-term predictor for which the prediction gain is more than 3 dB by using a normalized correlation greater than 0.5 in the optimal (pitch) delay determination. For the error concealment process, a 10 ms frame is declared periodic if at least one 5 ms subframe has a long-term prediction gain of more than 3 dB. Otherwise the frame is declared nonperiodic. An erased frame inherits its class from the preceding (reconstructed) speech frame. Note that the voicing classification is continuously updated based on this reconstructed speech signal. FIG. 2 illustrates the decoder with concealment parameters. The specific steps taken for an erased frame are as follows:
1) repeat the synthesis filter parameters. The LP parameters of the last good frame are used.
2) repeat pitch delay. The pitch delay is based on the integer part of the pitch delay in the previous frame and is repeated for each successive frame. To avoid excessive periodicity, the pitch delay value is increased by one for each next subframe but bounded by 143.
3) repeat and attenuate adaptive and fixed-codebook gains. The adaptive-codebook gain is an attenuated version of the previous adaptive-codebook gain: if the (m+1)st frame is erased, use gP(m+1)=0.9 gP(m). Similarly, the fixed-codebook gain is an attenuated version of the previous fixed-codebook gain: gC(m+1)=0.98 gC(m).
4) attenuate the memory of the gain predictor. The gain predictor for the fixed-codebook gain uses the energy of the previously selected fixed codebook vectors c(n), so to avoid transitional effects once good frames are received, the memory of the gain predictor is updated with an attenuated version of the average codebook energy over four prior frames.
5) generate the replacement excitation. The excitation used depends upon the periodicity classification. If the last good or reconstructed frame was classified as periodic, the current frame is considered to be periodic as well. In that case only the adaptive codebook contribution is used, and the fixed-codebook contribution is set to zero. In contrast, if the last reconstructed frame was classified as nonperiodic, the current frame is considered to be nonperiodic as well, and the adaptive codebook contribution is set to zero. The fixed-codebook contribution is generated by randomly selecting a codebook index and sign index.
Leung et al, Voice Frame Reconstruction Methods for CELP Speech Coders in Digital Cellular and Wireless Communications, Proc. Wireless 93 (July 1993) describes missing frame reconstruction using parametric extrapolation and interpolation for a low complexity CELP coder using 4 subframes per frame.
However, the repetition-based concealment methods have poor results.