Fundamental frequency (F0) estimation, also referred to as pitch detection, is an important component in a variety of speech processing systems, especially in the context of speech recognition, synthesis, and coding. The basic problem in fundamental frequency (F0) estimation is extraction of the fundamental frequency (F0) from a sound signal. The fundamental frequency (F0) is usually the lowest frequency component, or partial, which relates well to most of the other partials. In a periodic waveform, most partials are harmonically related, meaning that the frequencies of most of the partials are related to the frequency of the lowest partial by a small whole-number ratio. The frequency of this lowest partial is the fundamental frequency (F0) of the waveform.
Approaches to fundamental frequency (F0) estimation typically fall in one of three broad categories: approaches that principally utilize the time-domain properties of an audio signal, approaches that principally utilize the frequency-domain properties of an audio signal, and approaches that utilize both the frequency-domain and time-domain properties. In general, time-domain approaches operate directly on the audio waveform to estimate the pitch period. Peak and valley measurements, zero-crossing measurements, and autocorrelation measurements are the measurements most commonly used in the time-domain approaches. The basic assumption underlying these measurements is that simple time-domain measurements will provide good estimates of the period if a quasi-periodic signal is suitably processed to minimize the effects of the formant structure.
In general, frequency-domain approaches are based on the property that when an audio signal is periodic in the time domain, the frequency spectrum of the signal will consist of a series of impulses at the fundamental frequency and its harmonics. Thus, simple measurements can be made on the frequency spectrum of the signal (or a nonlinearly transformed version of the signal) to estimate the period of the signal. Further, approaches based on frequency-domain processing perform relatively well for non-speech audio signals as such signals do not require spectral flattening. However, these approaches are easily influenced by the presence of low-energy tonal components that are difficult to separate in the frequency domain. Moreover, they may be computation intensive.
The hybrid approaches may incorporate features of both time-domain and frequency-domain approaches. For example, a hybrid approach may use frequency-domain techniques to provide a spectrally flattened time waveform and then apply autocorrelation measurements to estimate the pitch period. More specifically, the autocorrelation of a spectrum-flattened signal is calculated where spectral flattening may be performed using, for example, cepstrum or LPC analysis, or by means of non-linear processing. The peaks of the autocorrelation function are separated by an amount that is approximately equal to the fundamental period. Hybrid approaches perform relatively well for clean speech but performance tends to degrade for speech corrupted by noise, speech mixed with music, and most types of music signals.