The human voice is produced by the vibration of vocal folds and the resonance of phonatory organs. It is known that a human being produces various sounds in order to change the loudness and pitch of his voice by controlling his vocal folds to change the frequency of their vibration or by changing the positions of his phonatory organs such as a nose and a tongue, namely by changing the shape of his vocal tract. It is also known that, when considering the sound of a voice as an acoustic signal, the feature of such an acoustic signal is that it contains spectral envelope components which change gradually according to the frequencies and spectral fine structure components which change periodically in a short time (for the case of voiced vowels and the like) or which change aperiodically (for the case of consonants and unvoiced vowels). The former spectral envelope components represent the resonance features of the phonatory organs, and used as features indicating the shapes of a human throat and mouth, for example, as features for speech recognition. On the other hand, the latter spectral fine structure components represent the periodicity of the sound source, and used as features indicating the fundamental periods of vocal folds, namely the voice pitches. The spectrum of a speech signal is expressed by the product of these two elements. A signal which contains the latter component which clearly indicates the fundamental period and the harmonic component thereof, particularly in a vowel part or the like, is also called a harmonic structure.
Conventionally, various methods for detecting a speech segment in an input acoustic signal have been suggested. They are roughly classified into the following: a method for identifying a speech segment using amplitude information, such as frequency band power and spectral envelope, indicating the rough shape of the spectrum of an input acoustic signal (hereinafter referred to as “method 1”); a method for detecting the opening and closing of a mouth in a video by analyzing it (“method 2”); a method for detecting a speech segment by comparing an acoustic model which represents speech and noise with the feature of an input acoustic signal (“method 3”); and a method for determining a speech segment by focusing attention on a speech spectral envelope shape determined by the shape of a vocal tract and a harmonic structure which is created by the vibration of vocal folds, which are both the features of articulatory organs (“method 4”).
However, method 1 has an inherent problem that it is difficult to distinguish between speech and noise, based on amplitude information only. So, in method 1, a speech segment and a noise segment are assumed and the speech segment is detected by relearning a threshold value determined in order to distinguish between the speech segment and the noise segment. Therefore, when the amplitude of the noise segment against the amplitude of the speech segment (namely, the speech signal-to-noise ratio (hereinafter referred to as “SNR”)) becomes large during the process of learning, the accuracy of the assumption itself of the noise segment and the speech segment has an influence on the performance, which reduces the accuracy of the threshold learning. As a result, there occurs a problem that the performance of speech segment detection is degraded.
In method 2, it is possible to maintain the detection/estimation accuracy of a speech segment constant regardless of the SNR if the opening of a mouth during the speech segment is detected, for example, not using sound input but only using an image. However, there are problems that the image processing costs more than the speech signal processing, and a speech segment cannot be detected if a mouth does not face toward a camera.
In method 3, it is difficult to assume noise in itself while the performance under the assumed environmental noise is ensured, so this method is available in the limited environment only. Although this method suggests a technique to learn the noise environment on the site, such technique has a problem that the performance is degraded depending on the accuracy of the learning method, as is the case with the method using amplitude information (i.e., method 1).
On the other hand, the method 4 has been suggested, in which a speech segment is detected by focusing attention on the spectral envelope shape determined by the vocal tract shape as well as the harmonic structure created by the vibration of vocal folds, which are the features of articulatory organs.
The method using the spectral envelope shape includes a method for evaluating the continuity of band power, for example, cepstra. In this method, the performance is degraded because it is hard to distinguish noise offset components under the lowered SNR situation.
A pitch detection method is one of the methods focusing attention on the harmonic structure, and various other methods have been suggested, such as a method for extracting an auto-correlation and a higher frequency part in the time domain and a method for creating an auto-correlation in the frequency domain. However, these methods have problems; for example, it is difficult to extract a speech segment if a current signal does not have a single pitch (harmonic fundamental frequency), and an extraction error is likely to occur due to environmental noise.
Additionally, there is a well-known technique of accentuating, suppressing, or separating and extracting an acoustic signal having a harmonic structure such as a human voice and a specific musical instrument, from an acoustic signal consisting of a mixture of several kinds of acoustic signals. For example, the following methods have been suggested: for speech signals, a noise reduction device which reduces only noise in an acoustic signal consisting of a mixture of noise signals and speech signals (See, for example, Japanese Laid-Open Patent Application No. 09-153769 Publication); and for music signals, a method for separating and removing a melody included in played music signal (See, for example, Japanese Laid-Open Patent Application No. 11-143460 Publication).
However, according to the method described in Japanese Laid-Open Patent Application No. 09-153769 Publication, speech and non-speech are detected by observing a linear predictive residual signal in each frequency band of an input signal. Therefore, this method has a problem that the performance is degraded under the non-stationary noise condition with the lower SNR in which the linear prediction does not work well.
The method described in Japanese Laid-Open Patent Application No. 11-143460 Publication is a method using the feature specific to melodies in music that a sound of the same pitch continues for a predetermined period of time. Therefore, there is a problem that it is as difficult to use this method as it is to separate speech from noise. In addition, the large amount of processing required for this method becomes a problem if one does not want to separate or remove acoustic components.
A method using the acoustic feature itself which represents a harmonic structure as an evaluation function has also been suggested (See, for example, Japanese Laid-Open Patent Application No. 2001-222289 Publication). FIG. 32 is a block diagram showing an outline structure of a speech segment determination device which uses the method suggested in Japanese Laid-Open Patent Application No. 2001-222289 Publication.
A speech segment detection device shown in FIG. 32 is a device which determines a speech segment in an input signal, and includes a fast Fourier transform (FFT) unit 100, a harmonic structure evaluation unit 101, a harmonic structure peak detection unit 102, a pitch candidate detection unit 103, an inter-frame amplitude difference harmonic structure evaluation unit 104 and a speech segment determination unit 105.
The FFT unit 100 performs FFT processing on an input signal for each frame (for example, one frame is 10 msec) so as to perform frequency transform on the input signal, and carries out various analyses thereof. The harmonic structure evaluation unit 101 evaluates whether or not each frame has a harmonic structure based on the frequency analysis result obtained from the FFT unit 100. The harmonic structure peak detection unit 102 converts the harmonic structure extracted by the harmonic structure evaluation unit 101 into the local peak shape, and detects the local peak.
The pitch candidate detection unit 103 detects a pitch by tracking the local peaks detected by the harmonic structure peak detection unit 102 in the time axis direction (frame direction). A pitch denotes the fundamental frequency of a harmonic structure.
The inter-frame amplitude difference harmonic structure evaluation unit 104 calculates the value of the inter-frame difference of the amplitudes obtained as a result of the frequency analysis by the FFT unit 100, and evaluates whether or not the current frame has a harmonic structure based on the difference value.
The speech segment determination unit 105 makes a comprehensive determination of the pitch detected by the pitch candidate detection unit 103 and the evaluation result by the inter-frame amplitude difference harmonic structure evaluation unit 104 so as to determine the speech segment.
According to the speech segment detection device 10 shown in FIG. 32, it becomes possible to determine a speech segment not only in an acoustic signal having a single pitch but also in an acoustic signal having a plurality of pitches.
However, when the pitch candidate detection unit 103 tracks local peaks, appearance and disappearance of such local peaks have to be considered, and it is difficult to detect the pitch with high accuracy considering such appearance and disappearance.
In view of the fact that a peak which is a local maximum value is handled, great resistance to noise cannot be expected. In addition, the inter-frame amplitude difference harmonic structure evaluation unit 104 evaluates whether or not the difference between frames has a harmonic structure in order to evaluate temporal fluctuations. However, since it just uses the difference of amplitudes, there is the problem that not only is the information of the harmonic structure lost, but also an acoustic feature itself of a sudden noise is evaluated as a difference value if such a sudden noise occurs.
Against this backdrop, the present invention has been conceived in order to solve the above-mentioned problems, and it is an object of the present invention to provide a harmonic structure acoustic signal detection method and device which allow highly accurate detection of a speech segment, not depending on the level fluctuations of an input signal.
It is another object thereof to provide a harmonic structure acoustic signal detection method and device with outstanding real-time features.