1. Field of the Invention
The present invention relates to a signal processing system and method for detecting an intended signal section and a noise signal section to be detected from a wave signal propagating through a medium such as light, a sound, an ultrasonic wave, and an electromagnetic wave. The term “medium” through which a wave signal propagates includes all the media, spaces, and locations through which a wave may propagate.
2. Description of the Related Art
An input signal obtained by receiving a wave signal from an intended wave source is likely to contain a noise signal other than an intended signal. When the level of a noise is high, the processing precision of the intended signal is degraded. Particularly in an application using speech recognition, when the level of a noise is high, a voice signal that is an intended signal cannot be recognized correctly. Therefore, conventionally, it is important in voice signal processing to detect an intended signal section and a noise signal section other than the intended signal section and separate them from each other.
In the prior art, in order to separate an intended signal section from a noise signal section, separation processing based on a change in a power of an input voice signal has been widely used. The basic principle thereof is as follows. The power of an input voice signal is checked, and when the power exceeds a threshold value, an intended signal section is identified to be separated.
Another processing of separating an intended signal section from a noise signal section is conducted as follows. The direction of arrival of an input signal is detected. When the direction in which a wave source transmitting an intended signal is assumed to be present is matched with the arrival direction of the input signal, the input signal is considered as an intended signal section to be separated. Input signals from the directions other than the direction in which a wave source is assumed to be present are considered as noise signals. In the prior art, as a method for detecting the arrival direction of an input signal, delay time detection processing using a correlation function and the like are known.
In a telephone and a speech recognition apparatus, in order to enhance ease of listening and a speech recognition ratio, noise suppression processing is added often in addition to the above-mentioned processing of detecting an intended signal section and a noise signal section. As conventional noise suppression processing, spectrum subtraction processing is widely known. The spectrum subtraction processing is conducted as follows. An input signal is converted into a spectrum in a frequency region by Fourier transformation, and thereafter, a noise spectrum model is presumed in a noise signal section. The presumed noise spectrum is subtracted from the spectrum of the input signal in an intended signal section to remove a noise signal, and the resultant signal is returned to a time region by inverse Fourier transformation.
However, the above-mentioned conventional processing of detecting an intended signal section and a noise signal section has the following problems.
First, in the processing of detecting an intended signal section and a noise signal section based on a change in a power of an input voice signal, if the level of a noise signal is close to that of an intended signal, it is difficult to detect the intended signal and the noise signal correctly.
FIG. 13 illustrates a system for suppressing a noise by the conventional processing of detecting a signal section based on a power of an input signal and the conventional processing of suppressing a noise based on spectrum subtraction. In particular, the case where a signal to be dealt with is a voice signal will be described.
Reference numeral 510 denotes a microphone. Reference numeral 520 denotes a power-based signal section detecting part for conducting conventional detection processing by comparing the power of an input signal with a predetermined threshold value to separate an intended signal section from a noise signal section. Reference numeral 530 denotes a spectrum subtracting part for suppressing a noise signal by conventional spectrum subtraction.
It is assumed that a sound to be input to the microphone 510 contains a voice signal 501 of a speaker and a noise signal 502. It is also assumed that the noise signal 502 contains a non-stationary noise signal as well as a stationary noise signal. An input signal 503 to the microphone 510 contains the voice signal 501 superimposed with the noise signal 502, and is composed of signal sections (1), (4) and (6) (containing a stationary noise), signal sections (2) and (5) (containing a non-stationary noise and a stationary noise), and a signal section (3) (containing a voice and a stationary noise).
The power-based signal section detecting part 520 receives the above-mentioned input signal to conduct the processing of detecting a signal section based on a power of an input signal, thereby obtaining a signal section detection result 504. The power-based signal section detecting part 520 determines the signal sections (1), (4) and (6) having a power below a threshold value as noise signal sections, and determines the signal sections (2), (3) and (5) having a power exceeding a threshold value as voice sections.
However, it is understood that the signal sections (2) and (5) are non-stationary noise signal sections, and hence, signal sections are not detected correctly.
As described above, according to the conventional processing of detecting a signal section based on a power of an input signal, a non-stationary noise signal section at a similar level to that of a voice signal may be erroneously determined to be a voice signal section, and a signal section may not be detected correctly. Furthermore, when a noise source is a voice of another person, even if a feature value other than a power such as a correlation function is used, the voice of another person that is a noise may be erroneously determined to be an intended voice.
Furthermore, according to the noise suppression result 505 obtained by the spectrum subtracting part 530, in the stationary noise signal sections (1), (4) and (6) and the voice signal section (3), a noise signal component is suppressed correctly and effectively due to the removal of a stationary noise. However, in the non-stationary noise signal sections (2) and (5), since they are erroneously determined to be voice signal sections in the signal section detection result 504, only a stationary noise signal component has been removed, and most of non-stationary noise signal components remain.
Thus, according to the conventional processing of detecting a signal section based on a power of an input signal, a non-stationary noise signal section may be erroneously detected as a voice signal section. Therefore, the processing of detecting a signal section cannot be conducted correctly. Furthermore, regarding the suppression of a noise signal, a non-stationary noise signal component cannot be suppressed.
Second, in the conventional processing of separating an intended signal section from a noise signal section based on an arrival direction of an input signal, if a noise source is present in the same direction as that of a wave source transmitting an intended sound, it is difficult to separate an intended signal from a noise signal correctly. That is, there is a possibility that a signal section detected as an intended signal section may contain a noise signal section.
Furthermore, regarding a signal section detected as a noise signal section, it is impossible to determine if the signal section is a stationary noise signal section or a non-stationary noise signal section.
FIG. 14 illustrates a system for suppressing a noise by the conventional processing of detecting a signal section based on an arrival direction of an input signal and the conventional processing of suppressing a noise based on spectrum subtraction.
A microphone 510 and a spectrum subtracting part 530 are the same as those in FIG. 13.
Reference numeral 540 denotes an arrival direction detecting part for detecting an arrival direction of an input signal and separating an intended signal section from a noise signal section based on the arrival direction. It is assumed that the processing of detecting an arrival direction is conducted by detecting a delay time using a correlation function.
It is assumed that a sound input to the microphone 510 contains a voice signal 501 and a noise signal 502 in the same way as in FIG. 13. It is also assumed that the noise signal 502 contains a stationary noise mixed with a non-stationary noise. A speaker and a noise source are present in different directions seen from a sensor. An input signal 503 to the microphone 510 contains the voice signal 501 superimposed with the noise signal 502, and is composed of signal sections (1), (4) and (6) (containing a stationary noise), signal sections (2) and (5) (containing a non-stationary noise and a stationary noise), and a signal section (3) (containing a voice and a stationary noise).
The arrival direction detecting part 540 receives the above-mentioned input signal 503 to conduct the processing of detecting a signal section based on an arrival direction of the input signal, and obtains a signal section detection result 506. The arrival direction detecting part 540 determines only the section (3), in which the previously set arrival direction (direction of a speaker) of an intended sound is matched with the arrival direction of an input signal, as a voice section, and determines the other sections (1), (2), (4), (5) and (6) as noise signal sections.
However, only with the arrival direction detecting part 540, it cannot be determined if the noise signal sections (1), (2), (4), (5) and (6) are the stationary noise signal sections or the non-stationary noise signal sections.
According to the noise suppression by the spectrum subtracting part 530, only a stationary noise is presumed by spectrum subtraction and suppressed. In the case of processing of detecting a signal section based on an arrival direction of an input signal, it cannot be determined if a detected noise signal section is a stationary noise signal section or a non-stationary noise signal section. Therefore, a noise model is presumed based on the respective noise signal sections (1), (2), (4), (5) and (6). Because of this, even in the non-stationary noise signal section (2) immediately before the voice signal section (3), a noise model is presumed. As a result, a noise spectrum presumed based on a noise model superimposed with a noise component that is not actually present in the voice signal section (3) is subtracted from an input spectrum, which distorts a signal in the voice signal section (3).