The invention relates to electronic devices, and more particularly to speech coding, transmission, and decoding/synthesis methods and circuitry.
The performance of digital speech systems using low bit rates has become increasingly important with current and foreseeable digital communications. Both dedicated channel and packetized-over-network (e.g., Voice over IP or Voice over Packet) transmissions benefit from compression of speech signals. The widely-used linear prediction (LP) digital speech coding compression method models the vocal tract as a time-varying filter and a time-varying excitation of the filter to mimic human speech. Linear prediction analysis determines LP coefficients ai, i=1,2, . . . , M, for an input frame of digital speech samples {s(n)} by settingr(n)=s(n)+ΣM≧≧1 ai s(n−i)   (1)and minimizing the energy Σr(n)2 of the residual r(n) in the frame. Typically, M, the order of the linear prediction filter, is taken to be about 10-12; the sampling rate to form the samples s(n) is typically taken to be 8 kHz (the same as the public switched telephone network sampling for digital transmission); and the number of samples {s(n)} in a frame is typically 80 or 160 (10 or 20 ms frames). A frame of samples may be generated by various windowing operations applied to the input speech samples. The name “linear prediction” arises from the interpretation of r(n)=s(n)+ΣM≧≧1 ai s(n−i) as the error in predicting s(n) by the linear combination of preceding speech samples—ΣM≧≧1 ai s(n−i). Thus minimizing Σr(n)2 yields the {ai} which furnish the best linear prediction for the frame. The coefficients {ai} may be converted to line spectral frequencies (LSFs) for quantization and transmission or storage and converted to line spectral pairs (LSPs) for interpolation between subframes.
The {r(n)} is the LP residual for the frame, and ideally the LP residual would be the excitation for the synthesis filter 1/A(z) where A(z) is the transfer function of equation (1). Of course, the LP residual is not available at the decoder; thus the task of the encoder is to represent the LP residual so that the decoder can generate an excitation which emulates the LP residual from the encoded parameters. Physiologically, for voiced frames the excitation roughly has the form of a series of pulses at the pitch frequency, and for unvoiced frames the excitation roughly has the form of white noise.
The LP compression approach basically only transmits/stores updates for the (quantized) filter coefficients, the (quantized) residual (waveform or parameters such as pitch), and (quantized) gain(s). A receiver decodes the transmitted/stored items and regenerates the input speech with the same perceptual characteristics. FIGS. 5-6 illustrate high level blocks of an LP system. Periodic updating of the quantized items requires fewer bits than direct representation of the speech signal, so a reasonable LP coder can operate at bits rates as low as 2-3 kb/s (kilobits per second).
The ITU standard G.729 uses frames of 10 ms length (80 samples) divided into two 5-ms 40-sample subframes for better tracking of pitch and gain parameters plus reduced codebook search complexity. Each subframe has an excitation represented by an adaptive-codebook contribution and a fixed (algebraic) codebook contribution. The adaptive-codebook contribution provides periodicity in the excitation and is the product of v(n), the prior frame's excitation translated by the current frame's pitch delay in time and interpolated, multiplied by a gain, gP. The fixed codebook contribution approximates the difference between the actual residual and the adaptive codebook contribution with a four-pulse vector, c(n), multiplied by a gain, gc. Thus the excitation is u(n)=gP v(n)+gc c(n) where v(n) comes from the prior (decoded) frame and gP, gc, and c(n) come from the transmitted parameters for the current frame. FIGS. 3-4 illustrate the encoding and decoding in block format; the postfilter essentially emphasizes any periodicity (e.g., vowels).
However, high error rates in wireless transmission and large packet losses/delays for network transmissions demand that an LP decoder handle frames which arrive too late for playout or in which so many bits are corrupted that the frame is ignored (erased). To maintain speech quality and intelligibility for wireless or voice-over-packet applications in the case of lost or erased frames, the decoder typically has methods to conceal such frame erasures. In particular, G.729 handles frame erasures and lost frames by reconstruction based on previously received information; that is, repetition-based concealment. Namely, replace the missing excitation signal with one of similar characteristics, while gradually decaying its energy by using a voicing classifier based on the long-term prediction gain (which is computed as part of the long-term postfilter analysis). The long-term postfilter finds the long-term predictor for which the prediction gain is more than 3 dB by using a normalized correlation greater than 0.5 in the optimal delay determination. For the error concealment process, a 10 ms frame is declared periodic if at least one 5 ms subframe has a long-term prediction gain of more than 3 dB. Otherwise the frame is declared nonperiodic. An erased frame inherits its class from the preceding (reconstructed) speech frame. Note that the voicing classification is continuously updated based on this reconstructed speech signal. The specific steps taken for an erased frame are as follows:    1) repetition of the synthesis filter parameters. The LP parameters of the last good frame are used.    2) attenuation of adaptive and fixed-codebook gains. The adaptive-codebook gain is based on an attenuated version of the previous adaptive-codebook gain: if the (m+1)st frame is erased, use gP(m+1)=0.9 gP(m). Similarly, the fixed-codebook gain is based on an attenuated version of the pervious fixed-codebook gain: gc(m+1)=0.98 gc(m).    3) attenuation of the memory of the gain predictor. The gain predictor for the fixed-codebook gain uses the energy of the previously selected fixed-codebook vectors c(n), so to avoid transitional effects once good frames are received, the memory of the gain predictor is updated with an attenuated version of the average codebook energy over four prior frames.    4) generation of the replacement excitation. The excitation used depends upon the periodicity classification. If the last reconstructed frame was classified as periodic, the current frame is considered to be periodic as well. In that case only the adaptive-codebook contribution is used, and the fixed-codebook contribution is set to zero. The pitch delay is based on the integer part of the pitch delay in the previous frame, and is repeated for each successive frame. To avoid excessive periodicity the pitch delay value is increased by one for each next subframe but bounded by 143 (pitch frequency of 70 Hz). In contrast, if the last reconstructed frame was classified as nonperiodic, the current frame is considered to be nonperiodic as well, and the adaptive codebook contribution is set to zero. The fixed-codebook contribution is generated by randomly selecting a codebook index and sign index. The frame classification allows the use of different decay factors for different types of excitation (e.g., 0.9 for periodic and 0.98 for nonperiodic gains). FIG. 2 illustrates the decoder with concealment parameters.
A major challenge in voice over packet (VOP) receivers is provision for continuous playout of voice packets in the presence of variable network delays (jitter). Continuous playout is often achieved by buffering the received voice packets for sufficient time so that most of the packets are likely received before their scheduled playout times. The playout delay due to buffering can be a fixed delay throughout the duration of a voice call or may adaptively vary during the call. Playout delay trades off packet losses (packet arrival after scheduled playout time) for overall delays. A very large playout delay ensures minimal packet losses but long gaps between messages and replies in two-way calls; and conversely, small playout delays leads to large packet losses but truly real-time conversation.
R. Ramjee et al, Adaptive Playout Mechanisms for Packetized Audio Applications in Wide-Area Networks, Proc. INFOCOM'94, 5d.3 (1994) and S. Moon et al, Packet Audio Playout Delay Adjustment: Performance Bounds and Algorithms, 5 ACM/Springer Multimedia Systems 17 (1998) schedule the playout times for all speech packets in a talkspurt (interval of essentially continuous active speech) at the beginning of the talkspurt. In particular, the playout time for the first packet in a talkspurt is set during the silence preceding the talkspurt, and the playout times of subsequent packets are just increments of the 20 ms frame length. The playout time of the first packet is taken as the then-current playout delay which can be determined by various algorithms: one approach has a normal mode and a spike mode. The normal mode would determine a playout delay as the sum of the filtered arrival delay of previously-arrived packets and a multiple of the filtered variation of the arrival delays; this provides a long-term adaptation. Spike mode would determine a playout delay from the arrival delay of the last-arrived packet during a delay spike (which is detected by a large gradient in packet delays); this provides a short-term adaptation. Ramjee et al and Moon et al use packets consisting of 160 PCM samples of speech sampled at the usual 8 kHz; that is, packetize 20 ms frames. And packets arriving after their scheduled playout times are discarded. Thus a large delay spike occurring within a talkspurt implies a sequence of lost packets because the playout delay cannot adjust until the next talkspurt.
Leung et al, Speech Coding over Frame Relay Networks, Proc. IEEE Workshop on Speech Coding for Telecommunications 75 (1993) describes an adaptive playout time for a CELP decoder which adjusts the playout delay by a fraction of the error of the current frame (packet) arrival delay from a target arrival delay. The playout delay is adjusted during a talkspurt by either extending or truncating the CELP excitation of a frame to one of the fixed lengths; that is, a 20 ms frame can be truncated to 10 or 15 ms only or extended to 25 or 40 ms only. Larger playout delay adjustments can be made only during silence frames. Also, lost or discarded packets (arriving after their playout times) can be reconstructed by using the CELP parameters of the last good frame together with bandwidth expansion of the LP coefficients and attenuation of the gains at 4 dB for successive reconstructed frames. Thus with a large delay spike (a sequence of late arriving and thus discarded frames), the initial lost frames could be reconstructions of the last good frame but would attenuate to essentially silence within a few frames. For isolated lost frames, both the prior and subsequent good frames can be used for two-sided prediction (e.g., interpolation) of CELP parameters for reconstruction.
However, these playout delay methods have poor results.