1. Field of the Invention
The present invention relates to a signal extraction system for extracting a necessary signal component from an inputted signal including a plurality of signal components, and further relates to a speech restoration system and speech restoration method for restoring or reproducing a speech from a noise superimposed speech using the signal extraction system. This invention also relates to a learning method for a neural network model, a constructing method of a neural network model, and a signal processing system.
2. Description of the Prior Art
As such a kind of signal extraction system, there has been known a system using a spectral subtraction method (which will be referred hereinafter to as an SS method). For example, a technique based on this SS method has been disclosed by the paper "Suppression of Acoustic Noise in Speech Using Spectral Subtraction" (referred hereinafter to as a document 1) reported in IEEE TRANSACTIONS ON ACOUSTIC, SPEED, AND SIGNAL PROCESSING, VOL. ASSP-27, NO 2, APRIL 1979. This technique is for the purpose of accepting as an input signal a signal in a time domain (taking time on the horizontal axis) developed due to the introduction of noises into a speech and extracting a speech signal from this input signal, and has frequently been employed as a preliminary treatment or preparation for noise countermeasures taken in speech recognition. A brief description will be made hereinbelow of the SS method for this technique.
That is, this SS method involves processes conducted as follows.
(1) First of all, after the observation of a noise signal, the finite length zone or interval of this noise signal undergoes Fourier transform to provide Fourier spectrum N(w) where w represents a frequency. A memory stores and retains the amplitude value .vertline.N(w).vertline. of the Fourier spectrum N(w).
(2) Secondly, the finite length interval of a speech signal including noises experiences the Fourier transform to provide a Fourier Spectrum I(w) where w signifies a frequency.
(3) Subsequently, the subtraction of the amplitude value .vertline.N(w).vertline. of the Fourier spectrum N(w) of the noise signal from the amplitude value .vertline.I(w).vertline. of the Fourier spectrum I(w) of the noise included speech signal is calculated as the following equation to produce an amplitude value .vertline.I'(w).vertline.. In this case, a portion where the production result becomes negative is replaced with a positive small constant. EQU .vertline.I'(w).vertline.=.vertline.I(w).vertline.-.vertline.N(w).vertline.
(4) Furthermore, a phase value of the Fourier spectrum I(w) is added to the produced amplitude value .vertline.I'(w).vertline. to produce a Fourier spectrum I'(w) according to the following equation. EQU I'(w)=.vertline.I'(w).vertline..multidot.(I(w)/.vertline.I(w).vertline.)
(5) Then, the inverse Fourier transform of the produced Fourier spectrum I'(w) is performed to output the resultant as a speech signal where noises are suppressed in the corresponding interval.
(6) Finally, a speech signal (noise-suppressed speech signal) is extracted from the input signal comprising a speech and noises introduced thereinto in a manner that the aforesaid processes from (2) to (5) are repeatedly conducted along the time axis.
There is a problem which arises with the above-mentioned SS method, however, in that, because of extracting the speech signal by the subtraction of the amplitude value of the noise Fourier spectrum, in cases where the noise Fourier spectrum greatly overlaps with the voice Fourier spectrum, much of the voice Fourier spectrum is subjected to the removal to thereby result in difficulty to extract the speech signal. Besides, for the same reason, even if being extracted, the speech signal may lack the original speech information to some extent.
In addition, although for the production of the Fourier spectrum I'(w) of the speech signal the phase value (I(w)/.vertline.I(w).vertline.) of the Fourier spectrum I(w) is added to the amplitude value .vertline.I'(w).vertline. resulting from the subtraction of the amplitude value of the noise Fourier spectrum from the amplitude value .vertline.I(w).vertline. of the Fourier spectrum I(w), this phase value signifies a phase value of a signal where noises are introduced into or superimposed on a speech and hence the Fourier spectrum I'(w) of the speech signal includes the phases of the noises. In other words, difficulty is encountered to restore the phase information of the original speech signal.
Furthermore, when a speech is extracted from an inputted noise superimposed speech in accordance with the aforesaid SS method, a problem still remains in that difficulty is encountered to remove unsteady or transient noises. For the elimination of this problem, a noise removal system using a neural network model has been disclosed by Japanese Examined Patent Publication No. 5-19337, where a neural network estimates a speech included in an inputted noise superimposed speech to output a voice estimation value to be used for the restoration of the speech. In this system, a hierarchical or layered neural network is used as the neural network and estimates the speech through learning and outputs the voice estimation value.
An operation of this layered neural network will be described hereinbelow with reference to FIG. 36. As shown in FIG. 36 data is taken out by a length corresponding to a speech extraction interval T from a noise superimposed speech Al and is given as input signals A2 (more specifically, input values I1, I2, . . . , Ip-1, Ip) to a learning-finished layered neural network 1. Thus, the layered neural network 1 picks a speech included in the input signal A2 to output it as output signals A3 (more specifically, output values S1, S2, . . . , Sp-1, Sp). Further, the layered neural network 1 repeatedly performs this operation to successively issue the output signals A3, thus finally outputting a speech (a voice estimation value) A4.
In addition, another example of noise removal system has been disclosed in Japanese Unexamined Patent Publication No. 2-72398, the technique of which is such that a plurality of microphone signals are produced through a plurality of microphones and inputted into a hierarchical neural network which in turn, issues a noise removed voice estimation value as an output signal through learning.
There is a problem which arises with such noise removal systems based on a neural network, however, in that a high-frequency component is lacking in the outputted voice estimation value. Particularly, in the case of restoring a speech with many consonants constituting high-frequency components, the aforesaid deficiency tends to remarkably occur. For this reason, the consonants are missing in the voice estimation value outputted from a noise removal system using a neural network, and hence, the speech due to the voice estimation value becomes unclear and hard to hear as compared with the original speech. An actual example of the lack of the high-frequency components will be described in detail with reference to FIGS. 37A and 37B.
FIG. 37A shows a waveform of the original speech developed when a male speaker says "Suichoku" (=vertical in English), while FIG. 37B illustrates a waveform of a voice estimation value outputted from a noise removal system using the neural network in the case that a noise superimposed speech produced by superimposing a noise on the original speech is inputted in the noise removal system. As obvious from FIGS. 37A and 37B, the consonants "s", "ch" and "k" are missing in the waveform of the voice estimation value, besides the high-frequency components of the voice portion "ui" are also lacking therein. Thus, a listener may take such a voice estimation value (see FIG. 37B) for "uiyoku".
Moreover, as described above the SS method being the noise suppression method taken in order to realize a speech recognition having no influence of environmental noises or a speech communication in a noisy environment encounters the difficulty of removing the unsteady noises, and for elimination of this problem there has been known a noise suppressing method (for example, Japanese Examined Patent Publication No. 5-19337 and Japanese Unexamined Patent Publication No. 2-72398) using a neural network model modeled on a human brain. In the noise removing system using a neural network model disclosed in Japanese Examined Patent Publication No. 5-19337, a layered neural network model learns to extract and output an aural signal from a noise superimposed speech and, after the learning, removes noises from an input signal. FIG. 38 shows a structure in a learning mode in the Japanese Examined Patent Publication No. 5-19337. For the input to a layered neural network model 2000, a noise superimposed speech is taken out by a length corresponding to a speech extraction interval and input signals I1, I2, I3 . . . , Ip produced by sampling the waveform within that interval at a sampling frequency are inputted to an input layer 2001. Further, teacher signals T1, T2, . . . , Tp to be compared with output signals S1, S2, . . . , Sp outputted from an output layer 2003 due to the input are signals attained in such a manner that an aural signal included in the input signals is sampled at a sampling frequency. The connection weights between the units (indicated by circles) constructing the layered neural network model 2000 are updated on the basis of the comparison between the output signals and the teacher signals so that model 2000 learns. In fact, for the learning the parameters of a multiplier is adjusted to sufficiently reduce the square error between the output signals and the teacher signals.
After the completion of the learning, a noise suppression mode, i.e., an execution mode, is made by a switching operation so that the actual noise superimposed speech is inputted to the layered neural network model 2000 and the output signals are D/A-converted to restore the aural signal. That is, the neural network model 2000 is required to output an output signal at a determined sampling frequency to external units. The sampling frequency necessary for the output signal will be referred hereinafter to as a requirements sampling frequency f0. In the above-mentioned prior art, the sampling frequencies of the teacher signal and the output signal are equal to the requirements sampling frequency f0. In general, in the case of a noise suppression neural network model directly receiving a noise superimposed speech waveform, the sampling frequency of the standard input signal is also made to be equal to the requirements sampling frequency f0. Also in the other prior art such as Japanese Unexamined Patent Publication 2-72398 the sampling frequencies of the teacher signal and the output signal are set to be equal to the requirements sampling frequency f0. Thus, in the prior noise suppression methods using a neural network model, the teacher signal is a speech sampled at a sampling frequency equal to the requirements sampling frequency f0.
However, as shown in FIG. 39A, a neural network model 3000 needs to realize a map, i.e., T1=S1 (i=1, 2, . . . p) mapping teacher signals T1, T2, . . . , Tp comprising P sampled values from the standard input signals I1, I2, I3, . . . , Ip comprising P sampled values. For this reason, the neural network model is required to estimate a desirable output waveform from the teacher signals T1, T2, . . . , Tp. In cases where as shown in FIG. 39B the desirable output waveform is chiefly composed of a low-frequency component and the output waveform slowly varies with respect to the sampling frequency, the estimation of the desirable output waveform is easy and the output waveform estimated by the neural network model substantially coincides with the desirable output waveform.
On the other hand, in cases where as shown in FIG. 39C the desirable output waveform includes a large high-frequency component, in other words, if the waveform has a complicated configuration, the estimation of the desirable output waveform becomes difficult to make the learning of the neural network model difficult. To put it concretely, the high-frequency component included in the output waveform estimated by the neural network model can not follow the high frequency component of the desirable output waveform, with the result that the high-frequency component tends to be missing.
In the case of using the neural network, a number of neural networks in which the learning is completed correctly are prepared and one which can exhibit the best performance is selected therefrom and put to use. However, in cases where, like the above-mentioned example, the learning is difficult, such a procedure cannot be taken, which greatly hinders the application of the neural network.
For getting a neural network model which can easily conduct learning and which can estimate the desirable waveform, the sampling frequency may be heightened, that is, the number of samples may be increased. In this case, it is necessary to increase the number of units of at least an input layer 2001 and the output layer 2003. The increase in the number of units causes the increase in the memory in a system which finally employs the neural network model. In addition, the calculation amount corresponding to the connection weights between the large number of units exceedingly increases, which requires a high-speed processing circuit. For these reasons, the system incorporating the neural network model becomes extremely high in cost.