A speech codec for Voice over IP (VoIP) is required high robustness against packet loss. It is demanded that a next-generation VoIP codec achieves error-free quality even at a comparatively high frame erasure rate (e.g. 6%) (when redundant information to conceal for erasure error is allowed to transmit).
In the case of Code excited linear prediction (CELP) speech codec, there are many cases where quality degradation due to frame erasure in the speech onset portion becomes a problem. One reason for this is that a signal in the onset portion varies greatly and has low correlation with the signal of the previous frame, and therefore concealment processing using information about the previous frame does not function effectively. Another reason is that, in a subsequent frame of the voiced portion, an excitation signal encoded in the onset portion is highly used as an adaptive codebook, and therefore the influence of the erasure in the onset portion continues to a subsequent voiced frame, which is likely to cause major distortion of a decoded speech signal.
As a conventional technique to solve the above-noted problems, there is a technique of transmitting the last glottal pulse position in the previous frame and encoded information of the current frame together (e.g. see Non-Patent Document 1). In this technique, a speech encoding apparatus detects the pulse position of the highest amplitude in the range of the past one pitch period including the frame end of the excitation signal (i.e. linear prediction residual signal) in the previous frame, as a glottal pulse position, encodes the position information and transmits the result and encoded information of the current frame to the speech decoding apparatus. When a decoded frame is erased, the speech decoding apparatus generates a decoded speech signal by allocating a glottal pulse to the glottal pulse position received as input from the speech encoding apparatus in the next frame.    Non-Patent Document 1: ITU-T Recommendation G.729.1