Voice over IP (VoIP) achieves speech communication through switching processing such as speech compressed encoding, packaging and packeting, routing distribution, storage and switching, and depackaging and decompression over the IP network or Internet. The coding technology is a key to VoIP, and can be classified into waveform coding, parametric coding, and hybrid coding. The waveform coding occupies a large bandwidth and is inapplicable to circumstances with insufficient bandwidths.
In order to enhance the transmission efficiency of VoIP in the case of limited bandwidths, a low bit rate coding/decoding method is proposed in the industry. International Telecommunication Union-Telecommunication Standardization Sector (ITU_T) publicized Telephone Bandwidth Speech Coding Standard G.729 in March of 1996, in which a conjugate-structure algebraic-code-excited linear-prediction (CS-ACELP) speech coding/decoding scheme is employed for speech signals with a code rate of 8 kb/s. Later on, ITU_T successively publicized G.729 Annex A and Annex B in November, 1996 to further optimize the G.729.
CS-ACELP is a coding mode on the basis of code-excited linear-prediction (CELP). Every 80 sampling points constitutes one speech frame. A speech signal is analyzed and then various parameters are extracted, such as linear-prediction filter coefficient, codebook sequence numbers in adaptive and fixed codebooks, adaptive code vector gain, and fixed code vector gain. These parameter codes are then sent to a decoding end. At the decoding end, as shown in FIG. 1, a received bit stream is first recovered into the parameter codes, and the parameter codes are then decoded into the parameters. An adaptive code vector is obtained from an adaptive codebook via an adaptive sector sequence number thereof. A fixed code vector is obtained from a fixed codebook via an adaptive sector sequence number thereof. Afterward, the obtained vectors are respectively multiplied by their own gains gc and gp , and then added point by point to construct an excitation sequence. A linear-prediction filter coefficient is employed to constitute a short-term filter. A so-called adaptive codebook method is adopted to implement a long-term or fundamental-tone synthesis filtering. After a synthetic speech is calculated, a long-term post-filter is employed to further improve the quality of speech.
However, when transmitted in a network, it is inevitable that an IP packet may be damaged during the transmission, discarded due to the network congestion, lost due to network failures, or even discarded just because it arrives at a receiving end too late and cannot be included in the replayed speech. Frame loss is the main reason for degradation in speech quality during the network transmission. Lost IP frames will not recur at the decoding end. When one codebook or several adjacent continuous codebooks are lost, the CS-ACELP decoder is confronted with two problems. One is the loss of all code elements contained in a group of sequentially arranged excitation signals. At this point, alternative excitation signals capable of generating the smallest speech quality distortion and transiting smoothly need to be obtained by calculation. When a frame loss occurs, all original adaptive codebook parameters, short-term linear-prediction filter coefficients, and gains are lost. Since the G.729 adopts a backward-adaptive coding mode, speech signals can be converged only after a certain period of time when a next good frame is received. Therefore, in the case of frame loss, the quality of speech of the G.729 decoder degrades rapidly.
Aiming at the frame loss phenomenon of the G.729, the G.729 Standard adopts a frame lost concealment technology of high-performance and low-complexity. Referring to FIG. 2, this technology includes the following steps.
In Step 201, a current lost frame is detected, and a long-term prediction gain of the last 5 ms good sub-frame before the lost frame is obtained from a long-term post-filter.
In practice, good frames such as speech frames or mute frames are forwarded to a frame lost concealment processing device by an upper-layer protocol layer such as a real-time transfer protocol (RTP) layer. A lost frame detection is also completed by the upper-layer protocol layer. On receiving a good frame, the upper-layer protocol layer directly forwards the good frame to the frame lost concealment processing device. When detecting a lost frame, the upper-layer protocol layer sends a frame loss indication to the frame lost concealment processing device; the frame lost concealment processing device receives the frame loss indication and determines that a frame loss occurs currently.
In Step 202, it is determined whether the long-term prediction gain of the last 5 ms good sub-frame before the lost frame is larger than 3 dB. If yes, the current lost frame is considered as a periodic frame, i.e., speech, and Step 203 is performed; otherwise, the current lost frame is considered as a non-periodic frame, i.e., non-speech, and Step 205 is performed.
In Step 203, a fundamental-tone delay of the current lost frame is calculated on the basis of a fundamental-tone delay of the last good frame before the lost frame. An adaptive codebook gain of the current lost frame is obtained by attenuating the energy of an adaptive codebook gain of the last good frame before the lost frame. Further, an adaptive codebook of the last good frame before the lost frame is taken as an adaptive codebook of the current lost frame.
In particular, the process of calculating the fundamental-tone delay of the current lost frame includes the following steps. First, an integer part T of the fundamental-tone delay of the last good frame before the lost frame is taken. If the current lost frame is an nth frame in continual lost frames, the fundamental-tone delay of the current lost frame equals T plus (n−1) sampling point durations. In order to avoid an excessive periodicity of the frame loss, the fundamental-tone delay of the lost frame is limited to a value no greater than that obtained by adding T to 143 sampling point durations.
In the G.729, a frame is 10 ms long and contains 80 sampling points. Thus, one sampling point lasts for 0.125 ms.
An adaptive codebook gain of the first lost frame in the continual lost frames is set to be identical with the adaptive codebook gain of the last good frame before the lost frame. Adaptive codebook gains of the second lost frame and lost frames after the second one in the continual lost frames are attenuated with an attenuation coefficient of 0.9 on the basis of the adaptive codebook gain of a former lost frame. That is, the adaptive codebook gain of the current lost frame is gpn=0.9gpn−1.
n represents a frame number of the current lost frame in the continual lost frames, gPn is the adaptive codebook gain of the current lost frame, n−1 represents a frame number of a former lost frame of the current lost frame in the continual lost frames, gPn−1 is an adaptive codebook gain of the former lost frame of the current lost frame, and n>1.
In Step 204, an excitation signal of the current lost frame is calculated on the basis of the fundamental-tone delay, the adaptive codebook gain, and the adaptive codebook. Thus, the flow is ended.
In Step 205, the fundamental-tone delay of the current lost frame is calculated on the basis of the fundamental-tone delay of the last good frame before the lost frame. A fixed codebook gain of the current lost frame is obtained by attenuating the energy of a fixed codebook gain of the last good frame before the lost frame. Further, a sequence number and a symbol of a fixed codebook of the current lost frame are obtained on the basis of a currently generated random number.
In particular, a fixed codebook gain of the first lost frame in the continual lost frames is set to be identical with the fixed codebook gain of the last good frame before the lost frame. Fixed codebook gains of the second lost frame and lost frames after the second lost frame in the continual lost frames are attenuated with an attenuation coefficient of 0.98 on the basis of the fixed codebook gain of a former lost frame. That is, the fixed codebook gain of the current lost frame is gcn=0.98*gcn−1.
n represents the frame number of the current lost frame in the continual lost frames, gcn is the fixed codebook gain of the current lost frame, n−1 represents the frame number of the former lost frame of the current lost frame in the continual lost frames, gcn−1 is a fixed codebook gain of the former lost frame of the current lost frame, and n>1.
The process of calculating the sequence number and the symbol of the fixed codebook specifically includes the following steps: first obtaining seed(n) on the basis of seed(n)=seed(n−1)×31821+13849, then adopting 0 to 12th least significant bits of seed(n) as the sequence number of the fixed codebook, and adopting 0 to 3rd least significant bits as the symbol of the fixed codebook, where seed(0)=21845.
In Step 206, the excitation signal of the current lost frame is calculated on the basis of the fundamental-tone delay, the fixed codebook gain, and the sequence number and symbol of the fixed codebook.
The method shown in FIG. 2 employs the fundamental-tone delay of the last good frame before the lost frame to estimate the fundamental-tone delay of the current lost frame, and completely adopts the adaptive codebook or the fixed codebook to recover the excitation signal of the lost frame on the basis of the fact whether the last good frame before the lost frame is speech or non-speech, so that the physiological characteristics of speech can be well compensated. However, in the case of poor network conditions, the compensation effect decreases rapidly. Meanwhile, since only the adaptive codebook excitation or fixed codebook excitation is taken during the recovery of the excitation signal of the lost frame and the fixed codebook excitation is merely a random number, any frame loss may again result in a large deviation of the recovered excitation signal. The higher the frame loss rate is, the larger the deviation will be. Therefore, the signal energy fluctuates greatly before and after the frame loss, and a sharp contrast in a receiver's subjective sensation will occur. Generally, when the frame loss rate is below 2%, this method may achieve a satisfactory effect. However, when the frame loss rate exceeds 2%, the effect is unsatisfactory.