Pitch is the fundamental frequency of a speech signal. It is one of the key parameters in speech coding and processing. Applications making use of pitch detection include speech enhancement, automatic speech recognition and understanding, analysis and modeling of prosody, as well as speech coding, in particular low bit-rate speech coding. The reliability of the pitch detection is often a decisive factor for the output quality of the overall system.
Typically, speech codecs process speech in segments of 10-30 ms. These segments are referred to as frames. Frames are often further divided into segments having a length of 5-10 ms called sub frames for different purposes.
The pitch is directly related to the pitch lag, which is the cycle duration of a signal at the fundamental frequency. The pitch lag can be determined for example by applying autocorrelation computations to a segment of an audio signal. In these autocorrelation computations, samples of the original audio signal segment are multiplied with aligned samples of the same audio signal segment, which has been delayed by a respective amount. The sum over the products resulting with a specific delay is a correlation value. The highest correlation value results with the delay, which corresponds to the pitch lag. The pitch lag is also referred to as pitch delay.
Before the highest correlation value is determined, the correlation values may be pre-processed to increase the accuracy of the result. A range of considered delays may also be divided into sections, and correlation values may be determined for delays in all or some of these sections. The autocorrelation computations may differ between the sections for instance in the number of samples that are considered. Further, the sectioning may be exploited in a pre-processing that is applied to the correlation values before the highest correlation value is determined.
A pitch track is a sequence of determined pitch lags for a sequence of segments of an audio signal.
The framework of an employed audio processing system sets the requirements for the pitch detection. Especially for conversational speech coding solutions, the complexity and delay requirements are often quite strict. Moreover, the accuracy of the pitch estimates and the stability of the pitch track is an important issue in many audio processing systems.
Accurate pitch estimation is a difficult task. While a pitch detection of low complexity may be able to provide generally very reliable pitch estimates, it often fails to maintain a stable pitch track. Very effective pitch estimation can be achieved with complex approaches, but these often produce pitch tracks that are not quite optimal in a used framework and/or that introduce too much delay for conversational applications.