1. Field of the Invention
The present invention is generally in the field of signal coding. In particular, the present invention is in the field of pitch determination for speech coding.
2. Background Art
Traditionally, all parametric speech coding methods make use of the redundancy inherent in the speech signal to reduce the amount of information that must be sent and to estimate the value of speech samples of a signal at short intervals. This redundancy primarily arises from the repetition of wave shapes at a periodic rate.
The redundancy of speech wave forms may be considered with respect to several different types of speech signal, such as voiced and unvoiced. For voiced speech, the speech signal is essentially periodic; however, this periodicity may be variable over the duration of a speech segment and the shape of the periodic wave usually changes gradually from segment to segment. As for the unvoiced speech, the signal is more like a random noise and has a smaller amount of predictability.
In either case, parametric coding may be used to reduce the redundancy of the speech segments by separating the excitation component of the speech from the spectral envelop component. The coding advantage arises from the slow rate at which the parameters change. However, it is difficult to estimate exactly the rate at which the parameters change. Yet, it is rare for the parameters to be significantly different from the values held within a few milliseconds. Accordingly, the sampling rate of the speech is such that the nominal frame duration is in the range of five to thirty milliseconds. In a more recent ITU standard Evrc, G.723 or EFR that has adopted the Code Excited Linear Prediction Technique (xe2x80x9cCELPxe2x80x9d), each frame includes 160 samples and is 20 milliseconds long.
A robust estimation of the pitch or fundamental frequency of speech is one of the classic problems in the art of speech coding. Accurate pitch estimation is a key to any speech coding algorithm. In CELP, for example, the pitch estimation is performed for each frame. For pitch estimation purposes, each 20 ms frame is processed in two 10 ms subframes. First, the pitch lag of the first 10 ms subframe is estimated using an open loop pitch estimation method. Subsequently, the pitch lag of the second 10 ms is estimated in a similar fashion. However, at the time of estimating the pitch lag of the second subframe, additional information or the pitch lag information of the first subframe is available to more accurately estimate the pitch lag of the second subframe. Traditionally, such information is used to better estimate and correct the pitch lag of the second subframe. The traditional approach allows for the past pitch information to be used for estimating the future pitch lag, since, as stated above, speech parameters are not significantly different from the values held within a few milliseconds previously. In particular, the pitch changes very slowly during voiced speech.
Referring to FIG. 2, an application of a conventional pitch lag estimation method is illustrated with reference to a speech signal 220. As shown, frame1212 is shown in two subframes for which pitch lag0231 and pitch lag1232 are estimated. The pitch lag0231 is obtained before the pitch lag1232 and is available for correcting the pitch lag1232. As further shown, the pitch lag information for each subframe of subsequent frames 213, 214, . . . 216 are computed in a sequential fashion. For example, the pitch lag1232 information would be available to help estimate pitch lag0 of frame2213, pitch lag0233 would be available to help estimate pitch lag1234, and so on. Accordingly, the past pitch information is conventionally used to estimate subsequent pitch lags.
The conventional approach suffers from incorrectly assuming that the past pitch lag information is always a proper indication of what follows. The conventional approach also lacks the ability to properly estimate the pitch in speech transition areas as well as other areas. Accordingly, there is a serious need in the art to provide a more accurate pitch estimation, especially in speech transition areas from unvoiced to voiced speech.
In accordance with the purpose of the present invention as broadly described herein, there is provided method and system for speech coding.
The encoder of the present invention processes an input signal on a frame-by-frame basis. Each frame is divided into first half and second half subframes. For a first frame, a pitch of the first half subframe of a subsequent frame (look-ahead subframe) is estimated. Using the look-ahead pitch information, a pitch of the second half subframe of the first frame is estimated and corrected.
In one aspect of the present invention, a pitch of the first half subframe of the first frame is also estimated and used to better estimate and correct the pitch of the second half subframe of the first frame. In another aspect of the invention, the pitch of the look-ahead frame is used as the pitch of the first half subframe of the subsequent frame.
In yet another aspect of the invention, a normalized correlation is calculated using the pitch of the look-ahead subframe. The normalized correlation is used to correct and estimate the pitch of the second half subframe of the first frame.