1. Field of the Invention
The present invention relates to the pitch detection of speech signals for various applications, and in particular, to a method and system providing pitch detection of speech signals for use in various audio effects, karaoke, scoring, voice recognition, etc.
2. Description of the Related Art
Pitch detection of speech signals finds applications in various audio effects, karaoke, scoring, voice recognition, etc. The pitch of a signal is the fundamental frequency of vibration of the source of the tone.
Speech signals can be segregated into two segments: voiced; and unvoiced speech. Voiced speech is produced using the vocal cords and is generally modeled as a filtered train of impulses within a frequency range. Unvoiced speech is generated by forcing air through a constriction in the vocal tract. Pitch detection involves the determination of the continuous pitch period during the voiced segments of speech.
The terms “speech” and “speech signal” are a broad reference to all forms of generated audio or sound. For example, “speech” and its associated “speech signal” can refer to talking, singing, attempted singing, whistling, humming, a recital, etc. The “speech” and “speech signal” can originate from an individual or a group, being human, animal or otherwise. The “speech” could also be artificially generated, for example by a computer or other electronic device.
There exist presently known techniques for pitch detection (see W. Hess, “Pitch Determination of Speech Signals: Algorithms and Devices”, Springer-Verlag, 1983). A time based pitch detector, estimates the pitch period by determining the glottal closure instant (GCI) and measuring the time period between each “event”. Frequency domain pitch detection can then be used to determine the pitch. Thus, the speech signal is processed period-by-period.
Autocorrelation Techniques
Correlation is the measure of similarity of two input functions, and in the case of the autocorrelation function Γ(d), the input functions are the same signal x(n), as shown in Equation 1,
                              Γ          ⁡                      (            d            )                          =                              Lim                          N              ->              ∞                                ⁢                      1                                          2                ·                N                            +              1                                ⁢                                    ∑                              n                =                                  -                  N                                                            +                N                                      ⁢                                                  ⁢                                          x                ⁡                                  (                  n                  )                                            ·                              x                ⁡                                  (                                      n                    +                    d                                    )                                                                                        (        1        )            where, d represents the lag or delay between the signal and a delayed segment, and N represents the number of samples of the input under consideration. If the signal is periodic or quasi-periodic, the similarities between x(n) and x(n+d) are higher. The correlation coefficients are also high if the lag is equal to a period or a multiple of a period.
As the autocorrelation function (ACF) is the Inverse Fourier Transform of the power spectrum of the input signal, the pitch is chosen as the frequency (ƒs/d) at which the maximum of the ACF occurs; i.e., where ƒs is the sampling frequency of the speech signal. Complications due to unknown phase relations and formant structures do not arise, as the technique is independent of these parameters.
Average Magnitude Difference Function
Signals that are similar do not exhibit a lot of differences. Thus, periodicity can be detected by investigation of the global deviation between the signals. The Average Magnitude Difference Function (AMDF) is defined as follows:
                              AMDF          ⁡                      (            d            )                          =                              1            K                    ⁢                                    ∑                              n                =                q                                            q                +                K                -                1                                      ⁢                                                  ⁢                                                                          x                  ⁡                                      (                    n                    )                                                  -                                  x                  ⁡                                      (                                          n                      +                      d                                        )                                                                                                                        (        2        )            where, K is the number of samples in a frame and q is the initial sample of the frame. AMDF has a strong minimum when the lag d is equal to the period of the input x(n). This minimum is exactly zero if the input is exactly periodic and the frequency (ƒs/d) denotes the pitch of the signal. The algorithm is phase insensitive as the harmonics are removed without regard to their phase.Component Frequency Ratios
An advantage of operating in the frequency domain in contrast to other domains is that the accuracy of the pitch estimate can be improved by interpolation techniques. Due to the Short Time Fourier Transformation principles used, the frequency resolution at the higher end of the spectrum is greater than at the lower end of the spectrum. Also, the fundamental might have a weak amplitude and hence it is usually computed as ratios of harmonic frequencies or the difference between adjacent spectral peaks caused by higher harmonics.
In cases where the fundamental is absent, it is sufficient to measure the distance between the adjacent or even non-adjacent peaks of the spectrum, representing the higher harmonics of the periodic or quasi-periodic signal. The ratios of the higher frequency harmonics are more accurate as the frequency resolution improves at higher frequencies. The greatest common factor is the pitch of the speech signal.
Time Domain Techniques
Autocorrelation techniques are susceptible to frequency overlap problems, also referred to as pitch halving or pitch doubling. Also, an autocorrelation has to be computed over a wide range of lags to determine the optimum pitch. Though a rough idea of the pitch can be obtained from the number of zero-crossings, the number of operations required for accurate pitch detection can be computationally intensive.
The AMDF algorithm is susceptible to intensity variations, noise and low frequency spurious signals, which directly affect the magnitude of the principal minimum at T0.
Frequency Domain Techniques
Since it is impractical to handle large segments of the input signal, the discrete version of the Short Time Fourier Transform (STFT), as proposed by Portnoff (M. R. Portnoff, “Implementation of the Digital Phase Vocoder Using the Fast Fourier Transform”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-24, pp. 243-248, June 1969), can be used in the signal analysis. Short time segments of the signal are “windowed” according to Fourier's theorem, which states that, any periodic waveform can be modeled as a sum of sinusoids with varying amplitudes and frequencies.
A fundamental problem, which arises due to the STFT, is “smearing” of the frequency response, which is illustrated in FIG. 1a-d (prior art). If the signal frequency coincides with one of the “bin” frequencies of the STFT, the original amplitude is retained after the STFT. However, if the signal frequency lies in between two adjacent bin frequencies of the STFT, the energy is spread over the entire spectrum, as is comparatively illustrated in FIGS. 1(a) and 1(b), where the y-axis presents signal amplitude in a logarithmic scale. Also, in the later case, as the peak frequency lies between two adjacent frequency bins, the amplitude detected is less. This is comparatively illustrated in FIGS. 1(c) and 1(d), which plot the amplitude spectrum in a linear scale. If the amplitude of the pitch frequency is too small, it might not be quantified as a potential candidate. Hence, it is critical to determine the true frequency of the signal.
This identifies a need for pitch detection of speech signals which overcomes or at least ameliorates the problems inherent in the prior art.