Digital speech compression systems typically require estimation of the fundamental frequency of an input signal. The fundamental frequency ƒ0 is usually estimated in terms of the pitch delay τ0 (otherwise known as “lag”). The two are related by the expression
                              τ          0                =                              f            s                                f            0            ′                                              (        1        )            where the sampling frequency ƒs, is commonly 8000 Hz for telephone grade applications.
Since a speech signal is generally non-stationary, it is partitioned into finite length vectors called frames, each of which is presumed to be quasi-stationary. The length of such frames is normally on the order of 10 to 40 milliseconds. The parameters describing the speech signal are then updated at the associated frame length intervals. The original Code Excited Linear Prediction (CELP) algorithms further updates the pitch period (using what is called Long Term Prediction, or LTP) information on shorter sub-frame intervals, thus allowing smoother transitions from frame to frame. It was also noted that although τ0 could be estimated using open-loop methods, far better performance was achieved using the closed-loop approach. Closed-loop methods involve a trial-and-error search of different possible values of τ0 (typically integer values from 20 to 147) on a sub-frame basis, and choosing the value that satisfies some minimum error criterion.
An enhancement to this method involves allowing τ0 to take on integer plus fractional values, as given in U.S. Pat. No. 5,359,696. An example of a practical implementation of this method can be found in the GSM half rate speech coder, and is shown in FIG. 1 and described in U.S. Pat. No. 5,253,269. Here, lags within the range of 21 to 22⅔ are allowed ⅓ sample resolution, lags within the range of 23 to 34⅚ are allowed ⅙ sample resolution, and so on. In order to keep the search complexity low, a combination of open-loop and closed loop methods is used. The open-loop method involves generating an integer lag candidate list using an autocorrelation peak picking algorithm. The closed-loop method then searches the allowable lags in the neighborhood of the integer lag candidates for the optimal fractional lag value. Furthermore, the lags for sub-frames 2, 3, and 4 are coded based on the difference from the previous sub-frame. This allows the lag information to be coded using fewer bits since there is a high intra-frame correlation of the lag parameter. Even so, the GSM HR codec uses a total of 8+(3×4)=20 bits every 20 ms (1.0 kbps) to convey the pitch period information.
In an effort to reduce the bit rate of the pitch period information, an interpolation strategy was developed that allows the pitch information to be coded only once per frame (using only 7 bits=>350 bps), rather than with the usual sub-frame resolution. This technique is known as relaxed CELP (or RCELP), and is the basis for the Enhanced Variable Rate Codec (EVRC) standard for Code Division Multiple Access (CDMA) wireless telephone systems. The basic principle is as follows.
The pitch period is estimated for the analysis window centered at the end of the current frame. The lag (pitch delay) contour is then generated, which consists of a linear interpolation of the past frame's lag to the current frame's lag. The linear prediction (LP) residual signal is then modified by means of sophisticated polyphase filtering and shifting techniques, which is designed to match the residual waveform to the estimated pitch delay contour. The primary reason for this residual modification process is to account for accuracy limitations of the open-loop integer lag estimation process. For example, if the integer lag is estimated to be 32 samples, when in fact the true lag is 32.5 samples, the residual waveform can be in conflict with the estimated lag by as many as 2.5 samples in a single 160 sample frame. This can severely degrade the performance of the LTP. The RCELP algorithm accounts for this by shifting the residual waveform during perceptually insignificant instances in the residual waveform (i.e., low energy) to match the estimated pitch delay contour. By modifying the residual waveform to match the estimated pitch delay contour, the effectiveness of the LTP is preserved, and the coding gain is maintained. In addition, the associated perceptual degradations due to the residual modification are claimed to be insignificant.
A further improvement to processing of the pitch delay contour information has been proposed in U.S. Pat. No. 6,113,653, in which a method of adjusting the pitch delay contour at intervals of less than of equal to one block in length is disclosed. In this method, a small number of bits are used to code an adjustment of the pitch delay contour according to some error minimization criteria. The method describes techniques for pitch delay contour adjustment by minimization of an accumulated shift parameter, or maximization of the cross correlation between the perceptually weighted input speech and the adaptive codebook (ACB) contribution passed through a perceptually weighted synthesis filter. Another well known pitch delay adjustment criterion may also include the minimization of the perceptually weighted error energy between the target speech and the filtered ACB contribution.
While this method utilizes a very efficient technique for estimating and coding pitch delay contour adjustment information, the low bit rate has the consequence of constraining the resolution and/or dynamic range of the pitch delay adjustment parameters being coded. Therefore a need exists for improving performance of low bit rate long-term predictors by adaptively modifying the dynamic range and resolution of the predictor step-size, such that higher long-term prediction gain is achieved for a given bit-rate, or alternatively, a similar long-term prediction is achieved at a lower bit-rate when compared to the prior art.