Pitch estimation in speech processing can be used to distinguish between voiced and unvoiced speech segments and to represent the tone of voiced speech. Since voiced speech can be approximated using a periodic signal, pitch may be estimated by measuring the signal period or its inverse, which is referred to as the fundamental frequency or pitch frequency. Where a periodic signal cannot be used to approximate a speech segment, the speech segment may be designated as unvoiced.
A variety of techniques have been developed for pitch estimation in both the time domain and the frequency domain. While both time-domain and frequency-domain methods of pitch determination are subject to instability and error, and accurate pitch determination is computationally intensive, frequency-domain methods are generally more tolerant with respect to the deviation of real speech data from the exact periodic model.
The Fourier transform of a periodic signal, such as voiced speech, has the form of a train of impulses, or peaks, in the frequency domain. This impulse train corresponds to the line spectrum of the signal, which can be represented as a sequence {(ai,θi)}, where θi are the frequencies of the peaks, and ai are the respective complex-valued line spectral amplitudes. To determine whether a given segment of a speech signal is voiced or unvoiced, and to calculate the pitch if the segment is voiced, the time-domain signal is first multiplied by a finite smooth window. The Fourier transform of the windowed signal is then given by
            X      ⁡              (        θ        )              =                  ∑        k            ⁢                        a          k                ⁢                  W          ⁡                      (                          θ              -                              θ                k                                      )                                ,where W(θ) is the Fourier transform of the window. Frequency-domain pitch estimation is typically based on analyzing the locations and amplitudes of the peaks in the transformed signal X(θ).
Given any pitch frequency, the line spectrum corresponding to that pitch frequency could contain line spectral components at multiples of that frequency only. It therefore follows that any frequency appearing in the line spectrum should be a multiple of the pitch frequency. Consequently, pitch frequency could be found as the maximal integer divider of the frequencies of spectral peaks appearing in the transformed signal. However, the presence of background noise and other deviations from the periodic model causes spectral peaks to move away from their exact prescribed locations, and spurious spectral peaks to appear at unpredictable locations as well.
It follows from the periodic model that changing of pitch frequency results in relatively minor changes in the low frequency spectral line locations and relatively significant deviations of the high frequency spectral line locations. Consequently, low frequency spectral peaks have greater influence on pitch estimation than do high frequency spectral peaks. For this reason, the accuracy of frequency-domain pitch estimation deteriorates significantly in the presence of low-frequency band noise. Low-frequency band noise is often present in the passenger compartment of a moving or idling automobile, thus severely limiting the applicability of known frequency-domain pitch estimation methods in mobile environments.