Field of the Invention
The present invention relates to a microphone array apparatus which has an array of microphones in order to detect the position of a sound source, emphasize a target sound and suppress noise.
The microphone array apparatus has an array of a plurality of omnidirectional microphones and equivalently define a directivity by emphasizing a target sound and suppressing noise. Further, the microphone array apparatus is capable of detecting the position of a sound source on the basis of a relationship among the phases of output signals of the microphones. Hence, the microphone array apparatus can be applied to a video conference system in which a video camera is automatically oriented towards a speaker and a speech signal and a video signal can concurrently be transmitted. In addition, the speech of the speaker can be clarified by suppressing ambient noise. The speech of the speaker can be emphasized by adding the phases of speech components. It is now required that the microphone array apparatus can stably operate.
If the microphone array apparatus is directed to suppressing noise, filters are connected to respective microphones and filter coefficients are adaptively or fixedly set so as to minimize noise components (see, for example, Japanese Laid-Open Patent Application No. 5-111090). If the microphone array apparatus is directed to detecting the position of a sound source, the relationship among the phases of the output signals of the microphones is detected, and the distance to the sound source is detected (see, for example, Japanese Laid-Open Patent Application Nos. 63-177087 and 4-236385).
An echo canceller is known as a device which utilizes the noise suppressing technique. For example, as shown in FIG. 1, a transmit/receive interface 202 of a telephone set is connected to a network 203. An echo canceller is connected between a microphone 204 and a speaker 205. A speech of a speaker is input to the microphone 204. A speech of a speaker on the other (remote) side is reproduced through the speaker 205. Hence, a mutual communication can take place.
A speech transferred from the speaker 205 to the microphone 204, as indicated by a dotted line shown in FIG. 1 forms an echo (noise) to the other-side telephone set. Hence, the echo canceller 201 is provided that includes a subtracter 206, an echo component generator 207 and a coefficient calculator 208. Generally, the echo generator 207 has a filter structure which produces an echo component from the signal which drives the speaker 205. The subtracter 206 subtracts the echo component from the signal from the microphone 204. The coefficient calculator 208 controls the echo generator 207 to update the filter coefficients so that the residual signal from the subtracter 206 is minimized.
The updating of the filter coefficients c1, c2, . . . , cr of the echo component generator 207 having the filter structure can be obtained by a known maximum drop method. For example, the following evaluation function J is defined based on an output signal e (the residual signal in which the echo component has been subtracted) of the subtracter 206: EQU J=e.sup.2 (1)
According to the above evaluation function, the filter coefficients c1, c2, . . . , cr are updated as follows: ##EQU1##
where 0.0&lt;.alpha.&lt;0.5 EQU f.sub.norm =(f(1).sup.2 +f(2).sup.2 + . . . f(r).sup.2).sup.1/2 (3)
In the above expressions, a symbol "*" denotes multiplication, and "r" denotes the filter order. Further, f(1), . . . , f(r) respectively denote the values of a memory (delay unit) of the filter (in other words, the output signals of delay units each of which delays the respective input signal by a sample unit). A symbol "f.sub.norm " is defined as equation (3), and a symbol ".alpha." is a constant, which represents the speed and precision of convergence of the filter coefficients towards the optimal values.
The echo canceller 201 has filter orders as many as 100. Hence, another echo canceller using a microphone array as shown in FIG. 2 is known. There are provided an echo canceller 211, a transmit/receive interface 212, microphones 214-1-214-n forming a microphone array, a speaker 215, a subtracter 216, filters 217-1-217-n, and a filter coefficient calculator 218.
In the structure shown in FIG. 2, acoustic components from the speaker 215 to the microphones 214-1-214-n are propagated along routes indicated by broken lines and serve as echoes. Hence, the speaker 215 is a noise source. The updating control of the filter coefficients c11, c12, . . . , c1r, . . . , cn1, cn2, . . . , cnr in the case where the speaker does not make any speech is expressed by using the evaluation function (1) as follows: ##EQU2##
where p=2, 3, . . . , n
The equation (4) relates to a case where one of the microphones 214-1-214-n, for example, the microphone 214-1 is defined as a reference microphone, and indicates the filter coefficients c11, c12, . . . , c1r of the filter 217-1 which receives the output signal of the above reference microphone 214-1. The equation (5) relates to the microphones 214-2 - 214-n other than the reference microphones, and indicates the filter coefficients c21, c22, . . . , c2r, . . . , cn1, cn2, . . . , cnr. The subtracter 216 subtracts the output signals 217-2-217-n of the microphones 214-2-214-n from the output signal 217-1 of the reference microphone 214-1.
FIG. 3 is a block diagram for explaining a conventional process of detecting the position of a sound source and emphasizing a target sound. The structure shown in FIG. 3 includes a target sound emphasizing unit 221, a sound source detecting unit 222, delay units 223 and 224, a number-of-delayed-samples calculator 225, an adder 226, a crosscorrelation coefficient calculator 227, a position detection processing unit 228 and microphones 229-1 and 229-2.
The target sound emphasizing unit 221 includes the delay units 223 and 224 of Z.sup.-da and Z.sup.-db, the number-of-delayed-samples calculator 225 and the adder 226. The sound source position detecting unit 222 includes the crosscorrelation coefficient calculator 227 and the position detection processing unit 228. The number-of-delayed samples calculator 225 is controlled by the following factors. The crosscorrelation coefficient calculator 227 of the sound source position detecting unit 222 obtains a crosscorrelation coefficient r(i) of output signals a(j) and b(j) of the microphones 229-1 and 229-2. The position detection processing unit 228 obtains the sound source position by referring to a value of i, imax, at which the maximum of the crosscorrelation coefficient r(i) can be obtained.
The crosscorrelation coefficient r(i) is expressed as follows: EQU r(i)=.SIGMA..sup.n.sub.j=1 a(j)*b(j+i) (6)
where .SIGMA..sup.n.sub.j=1 denotes a summation of j=1 to j=n, and i has a relationship -m.ltoreq.i .ltoreq.m. The symbol "m" is a value dependent on the distance between the microphones 229-1 and 229-2 and the sampling frequency, and is written as follows: EQU m=[(sampling frequency)*(intermichrophone distance)]/(speed of sound) (7)
where n is the number of samples for a convolutional operation.
The number of delayed samples da of the Z.sup.-da delay unit 223 and the number of delayed samples db of the Z.sup.-db delay unit 224 can be obtained as follows from the value imax at which the maximum value of the crosscorrelation coefficient r(i) can be obtained:
where i.gtoreq.0, da=i, db=0 PA1 where i.gtoreq.0, da=0, db=-i.
Hence, the phases of the target sound from the sound source are made to coincide with each other and are added by the adder 226. Hence, the target sound can be emphasized.
However, the above-mentioned conventional microphone array apparatus has the following disadvantages.
In the conventional structure directed to suppressing noise, when the speaker of the target sound source does not speak, the echo components from the speaker to the microphone array can be canceled by the echo canceller. However, when a speech of the speaker and the reproduced sound from the speaker are concurrently input to the microphone array, the updating of the filter coefficients for canceling the echo components (noise components) does not converge. That is, the residual signal e in the equations (4) and (5) corresponds to the sum of the components which cannot be suppressed by the subtracter 216 and the speech of the speaker. Hence, if the filter coefficients are updated so that the residual signal e is minimized, the speech of the speaker which is the target sound is suppressed along with the echo components (noise). Hence, the target noise cannot be suppressed.
In the conventional structure directed to detecting the sound source position and emphasizing the target sound, the output signals a(j) and b(j) of the microphones 229-1 and 229-2 shown in FIG. 3 generally have an autocorrelation in the vicinity of the sampled values. If the sound source is white noise or pulse noise, the autocorrelation is reduced, while the autocorrelation for vice is increased. The crosscorrelation function r(i) defined in the equation (6) has a less variation as a function of i with respect to a signal having comparatively large autocorrelation than a variation with respect to a signal having comparatively small autocorrelation. Hence, it is very difficult to obtain the correct maximum value and precisely and rapidly detect the position of the sound source.
In the conventional structure directed to emphasizing the target sound so that the phases of the target sounds are synchronized, the degree of emphasis depends on the number of microphones forming the microphone array. If there is a small crosscorrelation between the target sound and noise, the use of N microphones emphasizes the target sound so that the power ratio is as large as N times. If there is a large correction between the target sound and noise, the power ratio is small. Hence, in order to emphasize the target sound which has a large crosscorrelation to the noise, it is required to use a large number of microphones. This leads to an increase in the size of the microphone array. It is very difficult to identify, under noisy environment, the position of the power source by utilizing the crosscorrelation coefficient value of the equation (6).