A problem frequently encountered in many signal processing applications is to determine whether a portion of a signal is periodic or aperiodic and, in case it is found to be periodic, to measure the period length. This task is particularly important in processing acoustic signals, like human speech or music. In the case of such signals, the term “pitch” is used to refer to a fundamental frequency of a periodic or quasi-periodic signal. The fundamental frequency may be, e.g., a frequency, which may be perceived as a distinct tone by the human auditory system.
Although human pitch perception by itself is an auditory phenomenon, it generally correlates very well with a measured fundamental frequency of a signal. Fundamental frequency, or F0, is defined as the inverse of the fundamental period for some portion of a signal.
Pitch in human speech is manifested by nearly repeating waveforms in periodic “voiced” portions of speech signals, and the period between these repeating waveforms defines the pitch period. Such voiced speech sounds are produced by periodic oscillations of human vocal cords, which provide a source of periodic excitation for the vocal tract. Unvoiced portions of speech signals are produced by other, non-periodic, sources of excitation and normally do not exhibit any periodicity in a signal waveform.
In speech signal processing, accurate pitch and voicing estimation plays a very important role in speech compression, speech recognition, speech synthesis and many other applications. Pitch determination of speech signals has been a subject of intense research for over forty years. It is generally considered one of the most pervasive and difficult problems in speech analysis. A large number of methods for pitch determination have been developed to date, but so far no definitive solution has emerged. An article by W. Hess provides a survey of the many existing pitch determination methods (Hess, W., “Pitch and voicing determination”, in Advances in speech signal processing, eds. M. M. Sondhi and S. Furui, Marcel Dekker, New York, 1991, pp. 3–48). According to this survey, the majority of well-known pitch-determination methods can be classified as either short-term analysis or time-domain methods. The more reliable and popular techniques in use today are short-term analysis methods, operating on short portions, or frames, of a speech signal.
At present, most of the conventional short-term pitch-determination methods belong to one of the following three groups: (1) methods based on auto- or cross-correlation of a signal, (2) frequency-domain methods analyzing harmonic structure of a signal spectrum and (3) methods based on cepstrum calculation.
None of these conventional methods, however, was found fully satisfactory for all types of speech signals under realistic conditions, as all of them suffer from serious inherent limitations. For example, correlation-based pitch determination has one major drawback—the presence of secondary peaks due to speech formants (vocal tract resonances), in addition to main peaks corresponding to pitch period and its multiples. This property of the correlation function makes the selection of correct peaks very difficult. In order to circumvent this difficulty some sophisticated post-processing techniques, like dynamic programming, are commonly used to select proper peaks from computed correlation functions and to produce correct pitch contours. For example, a well-known and presently considered “state-of-the-art” pitch-tracking algorithm, which was implemented in ESPS/Waves+ software package, uses normalized cross-correlation and dynamic programming (Talkin, D., “A robust algorithm for pitch tracking (RAPT)” in Speech Coding and Synthesis, Elsevier, 1995, pp. 495–518). However, the drawbacks of correlation-based approaches are inherent in the very nature of a correlation function and, therefore, cannot be avoided. On the other hand, correlation-based methods are general in nature and can be applied to all kinds of signals. Correlation is also relatively immune to noise. At present, correlation-based methods for pitch and periodicity estimation are widely employed in speech coding standards for mobile phones and other speech communication devices.
Cepstrum-based methods are not particularly sensitive to speech formants, but tend to be rather sensitive to noise. In addition, a cepstrum-based approach lacks generality: it fails for some simple periodic signals. A cepstrum-based approach is unable to determine the fundamental period of an extremely band-limited signal, such as pure sine wave. However, some speech sounds are extremely band-limited and, therefore, cepstrum-based pitch detectors would fail in such instances, i.e., they would fail on an otherwise clearly periodic signal with a well-defined pitch.
Likewise, frequency-domain pitch-determination methods run into difficulties when the fundamental frequency component is actually missing in a signal, which is often the case with telephone-quality speech signals.
Hence, there is a great need for a new pitch determination method that is general in nature, reliable, accurate, and can overcome the limitations of current techniques.
One can think of the following desirable characteristics of an “ideal” (short-term) pitch-determination method.
It should not suffer from the effects associated with speech formants (vocal tract resonances).
It should be general in nature to work for all kinds of phase-distorted and band-limited signals, including the case of extremely band-limited signals (e.g. pure sine wave) and the case of a missing fundamental frequency component.
It should be able to approach a theoretical resolution limit of the time-domain methods. This means, in particular, that it should be capable of measuring a fundamental period using a portion of a signal a little longer than one complete period, at least for clean periodic signals.
It should be resistant to noise.
Evidently, none of the pitch-determination methods in use today comes anywhere close to possessing all of these characteristics. One of the reasons for such deficiency is a linear nature of signal processing employed by conventional short-term pitch-determination methods.
Speech generation by a human vocal apparatus, meanwhile, is a very complex nonlinear and non-stationary process, of which there is only an incomplete understanding. To achieve a complete and precise understanding of human speech production, it needs to be described in terms of nonlinear fluid dynamics. Unfortunately, this kind of description cannot be used directly for building signal processing devices. Traditionally, though, speech production has been described in terms of a source-filter model, which gives a good approximation for many purposes, but is inherently limited in its ability to model the true dynamics of speech production.
Therefore, it can be advantageous to dismiss conventional linear techniques, like spectral analysis and source-filter model, and to use a more general nonlinear approach, in order to describe the dynamics of human speech production.
Without making too many simplifying assumptions about speech production, one can state that (voiced) speech is generated by a relatively low-dimensional nonlinear dynamical system. The number of active degrees of freedom of this system and its internal state variables change rapidly over time and are not observable directly. The key issue, then, is how to recover and describe the underlying low-dimensional dynamics from a single one-dimensional observable, e.g., a speech signal.
One of the profound results established in the theory of nonlinear and chaotic systems and signals is the celebrated Takens' embedding theorem, which states that it is possible to reconstruct a state space that is topologically equivalent to the original state space of a dynamical system from a single observable (Takens, F., “Detecting strange attractors in turbulence”, in Lecture Notes in Mathematics, Vol. 898, eds. D. A. Rand and L. S. Young, Springer, Berlin, 1981). Chaos theory and nonlinear time-series analysis have attracted a lot of interest in the last two decades (For an overview, see Kantz, H. and Schreiber, T., Nonlinear Time Series Analysis, Cambridge University Press, 1998). Methods developed for analyzing nonlinear and chaotic signals and systems represent a radical departure from traditional linear signal-processing techniques. They are generally based on the concepts of state space (or phase space) of a system and time-series embedding. These techniques have already been tried on many types of signals (chaotic and non-chaotic), including human speech.
For example, a book chapter by G. Kubin “Nonlinear Processing of Speech” (in Speech Coding and Synthesis, Elsevier, 1995, pp. 557–610) describes some of the attempts to use state-space embedding techniques for speech analysis. The evidence is presented that voiced speech sounds, such as vowels, can be sufficiently embedded in 3-dimensional state space. It is also noted that reconstructed trajectories are periodic for vowels, and that pitch period can be measured in state space by using Poincaré sections (See also I. Mann and S. McLaughlin, “A nonlinear algorithm for epoch marking in speech signals using Poincaré maps”, Proceedings of EUSIPCO, vol.2, 1998, pp. 701–704). Yet, a reliable and accurate method for determining the fundamental frequency of a signal from its reconstructed state space has not been introduced to date.
In view of the above discussion, there remains a need for improved methods and apparatus for detecting periodicity and/or for determining the fundamental frequency of a signal, for example, a speech signal.