1. Field of the Invention
The present invention relates generally to speech coding and, more particularly, to pitch correlation of voiced speech.
2. Related Art
From time immemorial, it has been desirable to communicate between a speaker at one point and a listener at another point. Hence, the invention of various telecommunication systems. The audible range (i.e. frequency) that can be transmitted and faithfully reproduced depends on the medium of transmission and other factors. Generally, a speech signal can be band-limited to about 10 kHz without affecting its perception. However, in telecommunications, the speech signal bandwidth is usually limited much more severely. For instance, the telephone network limits the bandwidth of the speech signal to between 300 Hz to 3400 Hz, which is known in the art as the “narrowband”. Such band-limitation results in the characteristic sound of telephone speech. Both the lower limit at 300 Hz and the upper limit at 3400 Hz affect the speech quality.
In most digital speech coders, the speech signal is sampled at 8 kHz, resulting in a maximum signal bandwidth of 4 kHz. In practice, however, the signal is usually band-limited to about 3600 Hz at the high-end. At the low-end, the cut-off frequency is usually between 50 Hz and 200 Hz. The narrowband speech signal, which requires a sampling frequency of 8 kb/s, provides a speech quality referred to as toll quality. Although this toll quality is sufficient for telephone communications, for emerging applications such as teleconferencing, multimedia services and high-definition television, an improved quality is necessary.
The communications quality can be improved for such applications by increasing the bandwidth. For example, by increasing the sampling frequency to 16 kHz, a wider bandwidth, ranging from 50 Hz to about 7000 Hz can be accommodated. This bandwidth range is referred to as the “wideband”. Extending the lower frequency range to 50 Hz increases naturalness, presence and comfort. At the other end of the spectrum, extending the higher frequency range to 7000 Hz increases intelligibility and makes it easier to differentiate between fricative sounds.
Digitally, speech is synthesized by various well-known methods. One popular method is the Analysis-By-Synthesis (ABS) method. Analysis-By-Synthesis is also referred to as closed-loop approach or waveform-matching approach. It offers relatively better speech coding quality than other approaches for medium to high bit rates. One ABS approach is the so-called Code Excited Linear Prediction (CELP) method. In CELP coding, speech is synthesized by using encoded excitation information to excite a linear predictive coding (LPC) filter. The output of the LPC filter is compared against the voiced speech and used to adjust the filter parameters in a closed loop sense until the best parameters based upon the least error is found.
Pitch lag is one of the most important parameters for voiced speech, because the perceptual quality is very sensitive to pitch lag. CELP speech coding approaches rely on determination of open-loop pitch to help minimize the weighted errors in the closed-loop speech coding process. Open-loop pitch is usually determined using normalized pitch correlation on a weighted speech signal. With this approach, it is desirable to maximize correlation between a windowed reference signal and a candidate signal. Thus, the correlation window size is traditionally limited to have a good local pitch lag, a reliable determination of small pitch lags, and acceptable complexity. However, because voiced speech is not purely periodic, this approach may fail when the local pitch lag is larger than the window size and/or when an energy peak is not located within the window.
The present invention addresses the issues identified above regarding pitch lag determination.