1. Field of the Invention
The present invention relates to a method for estimating a pitch period in a system for coding and decoding speech and/or audio signals.
2. Background Art
In the field of speech coding, nearly all speech codecs require an estimate of the pitch period. During the encoding process, most of the popular predictive speech coding schemes, such as Code-Excited Linear Prediction (CELP) and Multi-Pulse Linear Predictive Coding (MPLPC) exploit the long-term correlation between speech samples at the pitch period present during voiced speech. Transform-based speech coding schemes such as Sinusoidal Transform Coding (STC) typically analyze the speech in the frequency-domain and extract a model-based set of parameters often including the pitch, frequency content, phase information, and energy level. In all of these systems, obtaining an accurate estimate of the pitch is critical to the performance.
In speech or audio coding, the coder converts an input signal into a compressed bit stream, usually partitioned into frames. These frames are either stored or transmitted after which the decoder converts the compressed frames into an output audio signal. During storage or transmission, the frames may be corrupted, lost, or received too late for playback. If this occurs, the decoder must attempt to conceal the effects of the lost frame. Often the signal processing techniques employed involve extrapolation of previously received waveforms to fill the void of the lost frame. If the previous signal is determined to be sufficiently periodic, the extrapolation is periodic. In this case, an accurate estimate of the pitch period in the previously buffered signal is required.
There are several known methods for pitch estimation. A common time-domain approach involves searching for the largest correlation or normalized correlation within a suitable range of the target pitch range. Frequency domain approaches also exist which involve identifying the peaks in the magnitude spectrum. Without regard for complexity, these straightforward approaches can be very complex. A common approach is to break up the pitch estimation into two steps. In the first step, a rough estimate of the pitch period is obtained, yielding a “coarse pitch”. In the final step, the coarse pitch is refined using more accurate signal processing techniques. A common first-step method is to first decimate the signal and perform pitch estimation on the decimated signal. Due to the reduced time resolution of the decimated signal, the pitch period is refined using the undecimated signal, but the search range is constrained about the coarse pitch.
In some applications, the pitch estimate computed in one time frame is used to estimate the pitch period in the adjacent time frame. This estimated pitch period is then refined within a limited search range. This technique takes advantage of the approximate short-term stationarity of speech signals. This technique is common in speech/audio coding systems which segment the speech frame into smaller frames or subframes. In this case, the pitch is estimated within the frame and used as a basis for pitch estimates in subsequent subframes.
The method of refining the estimated pitch period based on a coarse estimate is a very common and successful approach to reducing the complexity of pitch estimation. However, the pitch refinement step may present a significant complexity load in itself, depending on the accuracy of the coarse pitch. In the case of obtaining a coarse pitch estimate by signal decimation, the decimation factor determines the time resolution of the pitch estimate, and hence, the range of refinement required. Where a pitch estimate computed in one time frame is used as the basis for pitch refinement in another timeframe, the time separation between original estimate and pitch refinement determines the range of refinement. The more the frames are separated, the more range is required in the pitch refinement to account for pitch track deviation.