Speech signals contain many harmonic parts. Once identified, the fundamental frequency of these harmonic parts can be used for various purposes. One application of the identified fundamental frequency is separation of sound sources. During recording, sounds from multiple sound sources may be recorded simultaneously. The sounds from multiple sound sources include different speech signals, noises (for example, noises from fans) or other similar signals. To further analyze the signals, it is first necessary to separate interfering signals. The identified fundamental frequency can also be used for speech recognition and acoustic scene analysis.
There are various conventional methods of determining the fundamental frequency of harmonic signals. One widely used approach is using the autocorrelation function described, for example, in G. Hu and D. Wang, “Monaural speech segregation based on pitch tracking and amplitude,” IEEE Trans. On Neural Networks, 2004. In this approach, the signal is split into frequency bands by using a set of band pass filters. For each frequency band, the auto-correlation is determined and frequencies in a harmonic relation share the time peaks in the lag domain. Peaks also occur at the lag corresponding to multiples and partials of the true lag. These additional peaks interfere with the main peak when determining the fundamental frequency.
U.S. patent application Ser. No. 11/340,918 filed on Jan. 26, 2006, entitled “Determination of a common Fundamental Frequency of Harmonic Signals” by the same inventors describes a method of replacing the auto-correlation with the calculation of the distances between zero crossings of several orders in the individual frequency channels that also share peaks in the lag/distance domain. In other words, the fundamental frequency of the channels is estimated by calculating the zero crossing distances. If harmonics originate from the same fundamental frequency, the harmonics share zero crossing distances.
As described in U.S. patent application Ser. No. 11/340,918 and the article by Martin Heckmann and Frank Joublin, “Sound Source Separation for a Robot Based on Pitch,” International Conference on Intelligent Robots and Systems (IROS), Edmonton, Canada, pp. 203-208 (August 2005), the distance between two zero crossings in the channel of the fundamental frequency can be found again as the distance between three zero crossings in the first harmonic and the distance between four zero crossings in the second harmonic.
These distances between three or four zero crossings will also be referred to as higher order zero crossing distances, second and third order, respectively. In this case, however, spurious side peaks emerge.
An article by H. Duifhuis and R. Sluyter, “Measurement of pitch in speech: An implementation of Goldstein's theory of pitch perception,” J. Acoust. Soc. Am. pp. 1568-80, (1982) discloses using a different approach. This article describes using a comb filter, also called ‘harmonic sieve,’ set up with teeth at the fundamental frequency and its harmonics. The energy at each tooth is summed up for different fundamental frequency hypotheses. When the hypothesis and the true fundamental frequency coincide, all the teeth in the comb have high energy, resulting in a maximum. In previous methods, side peaks again occur at the harmonics and sub-harmonics of the true fundamental frequency.