As a technique for removing noise components from a speech signal inputted through a microphone, a signal processing technique using an adaptive microphone array which adopts a plurality of microphones and an adaptive filter has been heretofore known.
The following documents are considered herein:                [Patent document 1]        Japanese Unexamined Patent Publication No. 2003-280686        [Non-patent document 1]        L. J. Griffiths and C. W. Jim, “An alternative approach to linearly constrained adaptive beamforming”, IEEE Trans. AP, Vol. 30, no.1, pp. 27-34, January 1982        [Non-patent document 2]        Y. Kaneda and J. Ohga, “Adaptive microphone-array system for noise reduction,”        IEEE Trans. ASSP, vol. 34, no.6 pp. 1391-1400, December 1986        [Non-patent document 3]        Nagata, Fujioka, and Abe, “Study of speaker-tracking two-channel microphone array using SS control based on speaker direction”, Collected papers for Autumn Conference of Acoustic Society of Japan, 1999, p.477-478        
As major adaptive microphone arrays, a Griffiths-Jim array (refer to non-patent document 1), an adaptive microphone array for noise reduction (AMNOR; refer to non-patent document 2), and the like have been heretofore known. In any case, a signal in a noise interval in an observed signal is used to design an adaptive filter. Further, a technique has also been known in which a Griffiths-Jim array is realized in the frequency domain and in which detection accuracy is improved in speech and noise intervals (refer to non-patent document 3).
In such adaptive microphone array processing, noise reduction performance can be generally improved by increasing the number of used microphones. On the other hand, in information terminal devices and the like including personal computers, the number of microphones capable of being used for speech input is limited by constraints of cost and hardware. With the technique of the above-described non-patent document 3, noise-resistant adaptive microphone array processing can be realized by spectral subtraction using a two-channel microphone array.
FIG. 8 is a block diagram showing a conventional speech enhancement system using a two-channel beamformer. This system has two microphones 81a and 81b for converting acoustic signals into electric signals, an adder 82a for adding the input signals from the microphones 81a and 81b, an adder 82b for adding the input signal from the microphone 81b to the input signal from the microphone 81a after inverting the input signal from the microphone 81b, fast Fourier transformers 83a and 83b for performing fast Fourier transformation on the output signals from the adders 82a and 82b using a predetermined frame length and frame period, an adaptive filter 84 provided on the output side of the fast Fourier transformer 83b, and an adder 85 for adding the output signal from the adaptive filter 84 to the output signal of the fast Fourier transformer 83a after inverting the output signal from the adaptive filter 84.
In the case where a target speech source 1s emitting target speech to be enhanced is located equidistant from the microphones 81a and 81b in the front direction and where a noise source 1n is located in other direction, respective input signals m1(t) and m2(t) from the microphones 81a and 81b at time t can be represented by equation 1:m1(t)=s(t)+n(t), m2(t)=s(t)+n(t−d)   [Equation 1]where s(t) denotes a target speech signal which includes components based on the target speech, n(t) and n(t−d) denote noise signals which include components based on noise from the noise source 1n, and d denotes a delay time caused by the fact that the respective distances from the noise source in to the microphones 81a and 81b are different from each other.
At this time, the addition of the input signal m2(t) to the input signal m1(t) after inverting the input signal m2(t) using the adder 82b means that the input signals m1(t) and m2(t) are added together in the opposite phases. Accordingly, the target speech signals s(t) cancel out each other, and there remain only components having a correlation with the noise from the noise source in. When these components are referred to as a reference input r(t), the reference input r(t) can be represented by the following equation:r(t)=m1(t)−m2(t)=n(t)−n(t−d)   [Equation 2]
On the other hand, when a signal obtained by adding the input signals m1(t) and m2(t) together using the adding means 82a is referred to as a main input p(t), the main input p(t) can be represented by the following equation:p(t)=½(m1(t)+m2(t))=s(t)+½(n(t)+n(t−d))   [Equation 3]
Accordingly, an output signal Y in which the noise signals are reduced and in which the target speech signal is enhanced can be obtained by, in the frequency domain, subtracting the reference input from the main input by use of the adding means 85 and applying the adaptive filter 84 to the reference input to adjust a filter coefficient thereof. An output signal y(ω; n) at a frequency ω for a frame number n is given by the following equation:y(ω;n)=p(ω;n)−w(ω)r(ω;n)   [Equation 4]
Here, w(ω) denotes the filter coefficient of the adaptive filter 84 at the frequency ω, and p(ω; n) denotes the main input at the frequency ω for the frame number n. The expression r(ω; n) denotes the reference input at the frequency ω for the frame number n, and the amplitude of r(ω; n) is adjusted using the filter coefficient w(ω).
The filter coefficient w(ω) is adjusted using the input signals m1(t) and m2(t) in a noise interval so that an error e, represented by the equation below, squared is minimized. Incidentally, the noise interval means a time interval in which an input signal based only on noise occurs. Meanwhile, a time interval in which the target speech signal s(t) is contained in an input signal is referred to as a speech occurrence interval.e=p(ω;n)−w(ω)r(ω;n)   [Equation 5]
The reason for using input signals in the noise interval is that the learning of the filter coefficient is inhibited if components of the target speech signal are contained in the main input p(ω; n). Accordingly, it is difficult to estimate the filter coefficient w(ω) for removing extemporaneous noise which is completely superimposed on the target speech signal, which exists only in the speech occurrence interval, and which continues for a short time. Accordingly, in speech recognition for transcribing a lecture or a meeting, speech recognition in a car, or the like, extemporaneous noise, such as the sound of something hitting something else, the sound of touching paper for turning a page, the sound of closing a door, or the like, is one cause of deteriorating recognition accuracy.
On the other hand, as a speech recognition method in the presence of extemporaneous noise, a technique has been proposed in which matching between a feature of input speech and a composite model constituted by the Phonemic Hidden Markov model of speech data and the Hidden Markov model of noise data is performed and in which, based on the result, input speech is recognized (refer to patent document 1). In this technique, the type of target extemporaneous noise is necessarily known. However, in some cases, it may be difficult to forecast and model the types of noise which can occur, because various types of noise exist in an actual environment.
As described above, the Griffiths-Jim type is effective for the adaptive microphone array processing using the two-channel microphone array. In this type, the adaptive filter is designed by determining the filter coefficient based on the input signal in the noise interval so as to minimize the power of the noise components. However, in a scene of actual application to the speech recognition, various extemporaneous noises interfere with the speech recognition. An extemporaneous noise may not include the noise interval. In other words, there may be a case where the input signal containing extemporaneous noise components includes only the extemporaneous noise in the speech interval. In that case, the conventional Griffiths-Jim type array processing, in which the filter coefficient is determined based on the signal in the noise interval, cannot deal with the extemporaneous noise.
Meanwhile, according to the speech recognition technique of matching the composite model of both Hidden Markov models for the speeches and the noises, with the feature of the input signal, a type of an extemporaneous noise which is likely to occur must be forecasted and modeled in advance. Therefore, this technique cannot deal with unknown extemporaneous noises.