The transmission of compressed speech over packet-switching and mobile communications networks involves two major systems. The source speech system encodes the speech signal on a frame by frame basis, packetizes the compressed speech into bytes of information, or packets, and sends these packets over the network. Upon reaching the destination speech system, the bytes of information are unpacketized into frames and decoded. The G.723.1, dual rate speech coder, described in ITU-T Recommendation G.723.1, xe2x80x9cDual Rate Speech Coder for Multimedia Communications Transmitting at 5.3 and 6.3 kbit/s,xe2x80x9d March 1996 (hereafter xe2x80x9cReference 1xe2x80x9d, and incorporated herein by reference) was ratified by the ITU-T in 1996 and has since been used to add voice over various packet-switching as well as mobile communications networks. With a mean opinion score of 3.98 out of 5.0 (see, Thryft, A. R., xe2x80x9cVoice over IP Looms for Intranets in ""98,xe2x80x9d Electronic Engineering Times, August, 1997, Issue: 967, pp. 79, 102, hereafter xe2x80x9cReference 2xe2x80x9d, and incorporated herein by reference), the near toll quality of the G.723.1 standard is ideal for real-time multimedia applications over private and local area networks (LANs) where packet loss is minimal. However, over wide area networks (WANs), global area networks (GANs), and mobile communications networks, congestion can be severe, and packet loss may result in heavily degraded speech if left untreated. It is therefore necessary, to develop techniques to reconstruct lost speech frames at the receiver in order to minimize distortion and maintain output intelligibility.
The following discussion of the G.273.1 dual rate coder and its error concealment will assist in a full understanding of the invention.
The G.723.1 dual rate speech coder encodes 16-bit linear pulse-code modulated (PCM) speech, sampled at a rate of 8 KHz, using linear predictive analysis-by-synthesis coding. The excitation for the high rate coder is Multipulse Maximum Likelihood Quantization (MP-MLQ) while the excitation for the low rate coder is Algebraic-Code-Excited Linear-Prediction (ACELP). The encoder operates on a 30 ms frame size, equivalent to a frame length of 240 samples, and divides every frame into four subframes of 60 samples each. For every 30 ms speech frame, a 10th order Linear Prediction Coding (LPC) filter is computed and its coefficients are quantized in the form of Line Spectral Pair (LSP) parameters for transmission to the decoder. An adaptive codebook pitch lag and pitch gain are then calculated for every subframe and transmitted to the decoder. Finally, the excitation signal, consisting of the fixed codebook gain, pulse positions, pulse signs, and grid index, is approximated using either MP-MLQ for the high rate coder or ACELP for the low rate coder, and transmitted to the decoder. In sum, the resulting bitstream sent from encoder to decoder consists of the LSP parameters, adaptive codebook lags, fixed and adaptive codebook gains, pulse positions, pulse signs, and the grid index.
At the decoder, the LSP parameters are decoded and the LPC synthesis filter generates reconstructed speech. For every subframe, the fixed and adaptive codebook contributions are sent to a pitch postfilter, whose output is input to the LPC synthesis filter. The output of the synthesis filter is then sent to a formant postfilter and gain scaling unit to generate the synthesized output. In the case of indicated frame erasures, an error concealment strategy, described in the following subsection, is provided. FIG. 1 displays a block diagram of the G.723.1 decoder.
In the presence packet of losses, current G.723.1 error concealment involves two major steps. The first step is LSP vector recovery and the second step is excitation recovery. In the first step, the missing frame""s LSP vector is recovered by applying a fixed linear predictor to the previously decoded LSP vector. In the second step, the missing frame""s excitation is recovered using only the recent information available at the decoder. This is achieved by first determining the previous frame""s voiced/unvoiced classifier using a cross-correlation maximization function and then testing the prediction gain for the best vector. If the gain is more than 0.58 dB, the frame is declared as voiced, otherwise, the frame is declared as unvoiced. The classifier then returns a value of 0 if the previous frame is unvoiced, or the estimated pitch lag if the previous frame is voiced. In the unvoiced case, the missing frame""s excitation is then generated using a uniform random number generator and scaled by the average of the gains for subframes 2 and 3 of the previous frame. Otherwise, for the voiced case, the previous frame is attenuated by 2.5 dB and regenerated with a periodic excitation having a period equal to the estimated pitch lag. If packet losses continue for the next two frames, the regenerated excitation is attenuated by an additional 2.5 dB for each frame, but after three interpolated frames, the output is completely muted, as described in Reference 1.
The G.723.1 error concealment strategy was tested by sending various speech segments over a network with packet loss levels of 1%, 3%, 6%, 10%, and 15%. Single as well as multiple packet losses were simulated for each level. Through a series of informal listening tests, it was shown that although the overall output quality was very good for lower levels of packet loss, a number of problems persisted at all levels and became increasingly severe as packet loss increased.
First, parts of the output segment sounded unnatural and contained many annoying, metallic-sounding artifacts. The unnatural sounding quality of the output can be attributed to LSP vector recovery based on a fixed predictor as previously described. Since the missing frame""s LSP vector is recovered by applying a fixed predictor to the previous frame""s LSP vector, the spectral changes between the previous and reconstructed frames are not smooth. As a result of the failure to generate smooth spectral changes across missing frames, unnatural sounding output quality occurs, which increases unintelligibility during high levels of packet loss. In addition, many high-frequency, metallic-sounding artifacts were heard in the output. These metallic-sounding artifacts primarily occur in unvoiced regions of the output, and are caused by incorrect voicing estimation of the previous frame during excitation recovery. In other words, since a missing, unvoiced frame may incorrectly be classified as voiced, then transition into the missing frame will generate a high-frequency glitch, or metallic-sounding artifact, by applying the estimated pitch lag computed for the previous frame. As packet loss increases, this problem becomes even more severe, as incorrect voicing estimation generates increased distortion.
Another problem using G.723.1 error concealment was the presence of high-energy spikes in the output. These high-energy spikes, which are especially uncomfortable for the ear, are caused by incorrect estimation of the LPC coefficients during formant postfiltering, due to poor prediction of the LSP or gain parameter, using G.723.1 fixed LSP prediction and excitation recovery. Once again, as packet loss increases, the number of high-energy spikes also increases, leading to greater listener discomfort and distortion.
Finally, xe2x80x9cchoppyxe2x80x9d speech, resulting from complete muting of the output, was evident. Since G.723.1 error concealment reconstructs no more than three consecutive missing frames, all remaining missing frames are simply muted, leading to patches of silence in the output, or xe2x80x9cchoppyxe2x80x9d speech. Since there is a greater probability that more than three consecutive packets may be lost in a network, when packet loss increases, this will lead to increased xe2x80x9cchoppyxe2x80x9d speech and hence, decreased intelligibility and distortion at the output.
It is an object of the present invention to eliminate the above problems and improve upon the error concealment strategy defined in Reference 1. This and other objects are achieved by an improved lost frame recovery technique employing linear interpolation, selective energy attenuation, and energy tapering.
Linear interpolation of the speech model parameters is a technique designed to smooth spectral changes across frame erasures and hence, eliminate any unnatural sounding speech and metallic-sounding artifacts from the output. Linear interpolation operates as follows: 1) At the decoder, a buffer is introduced to store a future speech frame or packet. The previous and future information stored in the buffer are used to interpolate the speech model parameters for the missing frame, thereby generating smoother spectral changes across missing frames than if a fixed predictor were simply used, as in G.723.1 error concealment, 2) Voicing classification is then based on both the estimated pitch value and predictor gain for the previous frame, as opposed to simply the predictor gain as in G.723.1 error concealment; this improves the probability of correct voicing estimation for the missing frame. By applying the first part of the linear interpolation technique, more natural-sounding speech is achieved; by applying the second part of the linear interpolation technique, almost all unwanted metallic-sounding artifacts are effectively masked away.
To eliminate the effects of high-energy spikes, a selective energy attenuation technique was developed. This technique checks the signal energy for every synthesized subframe against a threshold value, and attenuates all signal energies for the entire frame to an acceptable level if the threshold is exceeded. Combined with linear interpolation, this selective energy attenuation technique effectively eliminates all instances of high-energy spikes from the output.
Finally, an energy tapering technique was designed to eliminate the effects of xe2x80x9cchoppyxe2x80x9d speech. Whenever multiple packets are lost in excess of one frame, this technique simply repeats the previous good frame for every missing frame by gradually decreasing the repeated frame""s signal energy. By employing this technique, the energy of the output signal is gradually smoothed or tapered over multiple packet losses, thus eliminating any patches of silence or a xe2x80x9cchoppyxe2x80x9d speech effect evident in G.723.1 error concealment. Another advantage of energy tapering is the relatively small amount of computation time required for reconstructing lost packets. Compared to G.723.1 error concealment, since this technique only involves gradual attenuation of the signal energies for repeated frames, as opposed to performing G.723.1 fixed LSP prediction and excitation recovery, the total algorithmic delay is considerably less.