Speech sounds are produced by modulating air flow in the speech tract. Voiceless sounds originate from turbulent noise created at a constriction somewhere in the vocal tract, while voiced sounds are excited in the larynx by periodic vibrations of the vocal cords. Roughly speaking, the variable period of the laryngeal vibrations gives rise to the pitch of the speech sounds. Low-bit-rate speech coding schemes typically separate the modulation from the speech source (voiced or unvoiced), and code these two elements separately. In order to enable the speech to be properly reconstructed, it is necessary to accurately estimate the pitch of the voiced parts of the speech at the time of coding. A variety of techniques have been developed for this purpose, including both time- and frequency-domain methods.
The Fourier transform of a periodic signal, such as voiced speech, has the form of a train of impulses, or peaks, in the frequency domain. This impulse train corresponds to the line spectrum of the signal, which can be represented as a sequence {(ai, θi)}, wherein θi are the frequencies of the peaks, and ai are the respective complex-valued line spectral amplitudes. To determine whether a given segment of a speech signal is voiced or unvoiced, and to calculate the pitch if the segment is voiced, the time-domain signal is first multiplied by a finite smooth window. The Fourier transform of the windowed signal is then given by:
                              X          ⁡                      (            θ            )                          =                              ∑            k                    ⁢                                    a              k                        ⁢                          W              ⁡                              (                                  θ                  -                                      θ                    k                                                  )                                                                        EQ        .                                  ⁢        1            wherein W(θ) is the Fourier transform of the window.
Given any pitch frequency, the line spectrum corresponding to that pitch frequency could contain line spectral components at all multiples of that frequency. It therefore follows that any frequency appearing in the line spectrum may be a multiple of a number of different candidate pitch frequencies. Consequently, for any peak appearing in the transformed signal, there will be a sequence of candidate pitch frequencies that could give rise to that particular peak, wherein each of the candidate frequencies is an integer dividend of the frequency of the peak. This ambiguity is present whether the spectrum is analyzed in the frequency domain, or whether it is transformed back to the time domain for further analysis.
Frequency-domain pitch estimation is typically based on analyzing the locations and amplitudes of the peaks in the transformed signal X(θ), such as by correlating the spectrum with the “teeth” of a prototypical spectral “comb.” The pitch frequency is given by the comb frequency that maximizes the correlation of the comb function with the transformed speech signal.
A related class of schemes for pitch estimation are known as “cepstral” schemes, where a log operation is applied to the frequency spectrum of the speech signal, and the log spectrum is then transformed back to the time domain to generate the cepstral signal. The pitch frequency is the location of the first peak of the time-domain cepstral signal. This corresponds precisely to maximizing over the period T, the correlation of the log of the amplitudes corresponding to the line frequencies z(i) with cos(ω(i)T). For each guess of the pitch period T, the function cos(ωT) is a periodic function of ω. It has peaks at frequencies corresponding to multiples of the pitch frequency 1/T. If those peaks happen to coincide with the line frequencies, then 1/T is a good candidate to be the pitch frequency, or some multiple thereof.
A common method for time-domain pitch estimation uses correlation-type schemes, which search for a pitch period T that maximizes the cross-correlation of a signal segment centered at time t and one centered at time t−T. The pitch frequency is the inverse of T.
Both time- and frequency-domain methods of pitch determination are subject to instability and error, and accurate pitch determination is therefore computationally intensive. In time domain analysis, for example, a high-frequency component in the line spectrum results in the addition of an oscillatory term in the cross-correlation. This term varies rapidly with the estimated pitch period T when the frequency of the component is high. In such a case, even a slight deviation of T from the true pitch period will reduce the value of the cross-correlation substantially and may lead to rejection of a correct estimate. A high-frequency component will also add a large number of peaks to the cross-correlation, which complicate the search for the true maximum. In the frequency domain, a small error in the estimation of a candidate pitch frequency will result in a major deviation in the estimated value of any spectral component that is a large integer multiple of the candidate frequency.
With currently known techniques, an exhaustive search with high resolution must be made over all possible candidates and their multiples in order to avoid missing the best candidate pitch for a given input spectrum. It is often necessary, dependent on the actual pitch frequency, to search the sampled spectrum up to high frequencies, such as above 1500 Hz. At the same time, the analysis interval, or window, must be long enough in time to capture at least several cycles of every conceivable pitch candidate in the spectrum, resulting in an additional increase in complexity. Analogously, in the time domain, the optimal pitch period T must be searched for over a wide range of times and with high resolution. The search in either case consumes substantial computing resources. The search criteria cannot be relaxed even during intervals that may be unvoiced, since an interval can be judged unvoiced only after all candidate pitch frequencies or periods have been ruled out. Although pitch values from previous frames are commonly used in guiding the search for the current value, the search cannot be limited to the neighborhood of the previous pitch. Otherwise, errors in one interval will be perpetuated in subsequent intervals, and voiced segments may be confused for unvoiced.