Field of the Invention
The present invention relates to a microphone array apparatus which has an array of microphones in order to detect the position of a sound source, emphasize a target sound and suppress noise.
The microphone array apparatus has an array of a plurality of omnidirectional microphones and equivalently define a directivity by emphasizing a target sound and suppressing noise. Further, the microphone array apparatus is capable of detecting the position of a sound source on the basis of a relationship among the phases of output signals of the microphones. Hence, the microphone array apparatus can be applied to a video conference system in which a video camera is automatically oriented towards a speaker and a speech signal and a video signal can concurrently be transmitted. In addition, the speech of the speaker can be clarified by suppressing ambient noise. The speech of the speaker can be emphasized by adding the phases of speech components. It is now required that the microphone array apparatus can stably operate.
If the microphone array apparatus is directed to suppressing noise, filters are connected to respective microphones and filter coefficients are adaptively or fixedly set so as to minimize noise components (see, for example, Japanese Laid-Open Patent Application No. 5-111090). If the microphone array apparatus is directed to detecting the position of a sound source, the relationship among the phases of the output signals of the microphones is detected, and the distance to the sound source is detected (see, for example, Japanese Laid-Open Patent Application Nos. 63-177087 and 4-236385).
An echo canceller is known as a device which utilizes the noise suppressing technique. For example, as shown in FIG. 1, a transmit/receive interface 202 of a telephone set is connected to a network 203. An echo canceller is connected between a microphone 204 and a speaker 205. A speech of a speaker is input to the microphone 204. A speech of a speaker on the other (remote) side is reproduced through the speaker 205. Hence, a mutual communication can take place.
A speech transferred from the speaker 205 to the microphone 204, as indicated by a dotted line shown in FIG. 1 forms an echo (noise) to the other-side telephone set. Hence, the echo canceller 201 is provided that includes a subtracter 206, an echo component generator 207 and a coefficient calculator 208. Generally, the echo generator 207 has a filter structure which produces an echo component from the signal which drives the speaker 205. The subtracter 206 subtracts the echo component from the signal from the microphone 204. The coefficient calculator 208 controls the echo generator 207 to update the filter coefficients so that the residual signal from the subtracter 206 is minimized.
The updating of the filter coefficients c1, c2, . . . , cr of the echo component generator 207 having the filter structure can be obtained by a known maximum drop method. For example, the following evaluation function J is defined based on an output signal e (the residual signal in which the echo component has been subtracted) of the subtracter 206:
J=e2xe2x80x83xe2x80x83(1)
According to the above evaluation function, the filter coefficients c1, c2, . . . , cr are updated as follows:                               [                                                    c1                                                                    c2                                                                    ⋮                                                                    cr                                              ]                =                              [                                                                                c1                    old                                                                                                                    c2                    old                                                                                                ⋮                                                                                                  cr                    old                                                                        ]                    +                      α            *                          (                              e                ⁢                                  /                                ⁢                                  f                  norm                                            )                        *                          [                                                                                          f                      ⁡                                              (                        1                        )                                                                                                                                                        f                      ⁡                                              (                        2                        )                                                                                                                                  ⋮                                                                                                              f                      ⁡                                              (                        r                        )                                                                                                        ]                                                          (        2        )            
where 0.0 less than xcex1 less than 0.5
fnorm=(f(1)2+f(2)2+. . . f(r)2)1/2xe2x80x83xe2x80x83(3)
In the above expressions, a symbol xe2x80x9c*xe2x80x9d denotes multiplication, and xe2x80x9crxe2x80x9d denotes the filter order. Further, f(1), . . . , f(r) respectively denote the values of a memory (delay unit) of the filter (in other words, the output signals of delay units each of which delays the respective input signal by a sample unit). A symbol xe2x80x9cfnormxe2x80x9d is defined as equation (3), and a symbol xe2x80x9cxcex1xe2x80x9d is a constant, which represents the speed and precision of convergence of the filter coefficients towards the optimal values.
The echo canceller 201 has filter orders as many as 100. Hence, another echo canceller using a microphone array as shown in FIG. 2 is known. There are provided an echo canceller 211, a transmit/receive interface 212, microphones 214-1-214-n forming a microphone array, a speaker 215, a subtracter 216, filters 217-1-217-n, and a filter coefficient calculator 218.
In the structure shown in FIG. 2, acoustic components from the speaker 215 to the microphones 214-1-214-n are propagated along routes indicated by broken lines and serve as echoes. Hence, the speaker 215 is a noise source. The updating control of the filter coefficients c11, c12, . . . , c1r, . . . , cn1, cn2, . . . , cnr in the case where the speaker does not make any speech is expressed by using the evaluation function (1) as follows:                               [                                                    cl1                                                                    cl2                                                                    ⋮                                                                    clr                                              ]                =                              [                                                                                cl1                    old                                                                                                                    cl2                    old                                                                                                ⋮                                                                                                  clr                    old                                                                        ]                    -                      α            *                          (                              e                ⁢                                  /                                ⁢                                  fl                  norm                                            )                        *                          [                                                                                          fl                      ⁡                                              (                        1                        )                                                                                                                                                        fl                      ⁡                                              (                        2                        )                                                                                                                                  ⋮                                                                                                              fl                      ⁡                                              (                        r                        )                                                                                                        ]                                                          (        4        )                                                      [                                                            cp1                                                                              cp2                                                                              ⋮                                                                              cpr                                                      ]                    =                                    [                                                                                          cp1                      old                                                                                                                                  cp2                      old                                                                                                            ⋮                                                                                                              cpr                      old                                                                                  ]                        +                          α              *                              (                                  e                  ⁢                                      /                                    ⁢                                      fp                    norm                                                  )                            *                              [                                                                                                    fp                        ⁡                                                  (                          1                          )                                                                                                                                                                        fp                        ⁡                                                  (                          2                          )                                                                                                                                                ⋮                                                                                                                          fp                        ⁡                                                  (                          r                          )                                                                                                                    ]                                                    ⁢                  
                ⁢                                            where              ⁢                              xe2x80x83                            ⁢              p                        =            2                    ,          3          ,          …          ⁢                      xe2x80x83                    ,          n                                    (        5        )            
The equation (4) relates to a case where one of the microphones 214-1-214-n, for example, the microphone 214-1 is defined as a reference microphone, and indicates the filter coefficients c11, c12, . . . , c1r of the filter 217-1 which receives the output signal of the above reference microphone 214-1. The equation (5) relates to the microphones 214-2-214-n other than the reference microphones, and indicates the filter coefficients c21, c22, . . . , c2r, . . . , cn1, cn2, . . . , cnr. The subtracter 216 subtracts the output signals 217-2-217-n of the microphones 214-2-214-n from the output signal 217-1 of the reference microphone 214-1.
FIG. 3 is a block diagram for explaining a conventional process of detecting the position of a sound source and emphasizing a target sound. The structure shown in FIG. 3 includes a target sound emphasizing unit 221, a sound source detecting unit 222, delay units 223 and 224, a number-of-delayed-samples calculator 225, an adder 226, a crosscorrelation coefficient calculator 227, a position detection processing unit 228 and microphones 229-1 and 229-2.
The target sound emphasizing unit 221 includes the delay units 223 and 224 of Zxe2x88x92da and Zxe2x88x92db, the number-of-delayed-samples calculator 225 and the adder 226. The sound source position detecting unit 222 includes the crosscorrelation coefficient calculator 227 and the position detection processing unit 228. The number-of-delayed samples calculator 225 is controlled by the following factors. The crosscorrelation coefficient calculator 227 of the sound source position detecting unit 222 obtains a crosscorrelation coefficient r(i) of output signals a(j) and b(j) of the microphones 229-1 and 229-2. The position detection processing unit 228 obtains the sound source position by referring to a value of i, imax, at which the maximum of the crosscorrelation coefficient r(i) can be obtained.
The crosscorrelation coefficient r(i) is expressed as follows:
r(i)=xcexa3nj=1a(j)*b(j+i)xe2x80x83xe2x80x83(6)
where xcexa3nj=1 denotes a summation of j=1 to j=n, and i has a relationship xe2x88x92mxe2x89xa6ixe2x89xa6m. The symbol xe2x80x9cmxe2x80x9d is a value dependent on the distance between the microphones 229-1 and 229-2 and the sampling frequency, and is written as follows:
m=[(sampling frequency)*(intermichrophone distance)]/(speed of sound)xe2x80x83xe2x80x83(7)
where n is the number of samples for a convolutional operation.
The number of delayed samples da of the Zxe2x88x92da delay unit 223 and the number of delayed samples db of the Zxe2x88x92db delay unit 224 can be obtained as follows from the value imax at which the maximum value of the crosscorrelation coefficient r(i) can be obtained:
where ixe2x89xa70, da=i, db=0
where i less than 0, da=0, db=xe2x88x92i.
Hence, the phases of the target sound from the sound source are made to coincide with each other and are added by the adder 226. Hence, the target sound can be emphasized.
However, the above-mentioned conventional microphone array apparatus has the following disadvantages.
In the conventional structure directed to suppressing noise, when the speaker of the target sound source does not speak, the echo components from the speaker to the microphone array can be canceled by the echo canceller. However, when a speech of the speaker and the reproduced sound from the speaker are concurrently input to the microphone array, the updating of the filter coefficients for canceling the echo components (noise components) does not converge. That is, the residual signal e in the equations (4) and (5) corresponds to the sum of the components which cannot be suppressed by the subtracter 216 and the speech of the speaker. Hence, if the filter coefficients are updated so that the residual signal e is minimized, the speech of the speaker which is the target sound is suppressed along with the echo components (noise). Hence, the target noise cannot be suppressed.
In the conventional structure directed to detecting the sound source position and emphasizing the target sound, the output signals a(j) and b(j) of the microphones 229-1 and 229-2 shown in FIG. 3 generally have an autocorrelation in the vicinity of the sampled values. If the sound source is white noise or pulse noise, the autocorrelation is reduced, while the autocorrelation for vice is increased. The crosscorrelation function r(i) defined in the equation (6) has a less variation as a function of i with respect to a signal having comparatively large autocorrelation than a variation with respect to a signal having comparatively small autocorrelation. Hence, it is very difficult to obtain the correct maximum value and precisely and rapidly detect the position of the sound source.
In the conventional structure directed to emphasizing the target sound so that the phases of the target sounds are synchronized, the degree of emphasis depends on the number of microphones forming the microphone array. If there is a small crosscorrelation between the target sound and noise, the use of N microphones emphasizes the target sound so that the power ratio is as large as N times. If there is a large correction between the target sound and noise, the power ratio is small. Hence, in order to emphasize the target sound which has a large crosscorrelation to the noise, it is required to use a large number of microphones. This leads to an increase in the size of the microphone array. It is very difficult to identify, under noisy environment, the position of the power source by utilizing the crosscorrelation coefficient value of the equation (6).
It is a general object of the present invention to provide a microphone array apparatus in which the above disadvantages are eliminated.
A more specific object of the present invention is to provide a microphone array apparatus capable of stably and precisely suppressing noise, emphasizing a target sound and identifying the position of a sound source.
The above objects of the present invention are achieved by a microphone array apparatus comprising: a microphone array including microphones (which correspond to parts indicated by reference numbers 1-1-1-n in the following description), one of the microphones being a reference microphone (1-1); filters (2-1-2-n) receiving output signals of the microphones; and a filter coefficient calculator (4) which receives the output signals of the microphones, a noise and a residual signal obtained by subtracting filtered output signals of the microphones other than the reference microphone from a filtered output signal of the reference microphone and which obtain filter coefficients of the filters in accordance with an evaluation function based on the residual signal. With this structure, even when speech of a speaker corresponding to the sound source and the noise are concurrently applied to the microphones, the crosscorrelation function value is reduced so that the noise can be effectively suppressed and the filter coefficients can continuously be updated.
The above microphone array apparatus may be configured so that it further comprises: delay units (8-1-8-n) provided in front of the filters; and a delay calculator (9) which calculates amounts of delays of the delay units on the basis of a maximum value of a crosscorrelation function of the output signals of the microphones and the noise. Hence, the filter coefficients can easily be updated.
The microphone array apparatus may be configured so that the noise is a signal which drives a speaker. This structure is suitable for a system that has a speaker in addition to the microphones. A reproduced sound from the speaker may serve as noise. By handling the speaker as a noise source, the signal driving the speaker can be handled as the noise, and thus the filter coefficients can easily be updated.
The microphone array apparatus may further comprise a supplementary microphone (21) which outputs the noise. This structure is suitable for a system which has microphones but does not have a speaker. The output signal of the supplementary microphone can be used as the noise.
The microphone array apparatus may be configured so that the filter coefficient calculator includes a cyclic type low-pass filter (FIG. 10) which applies a comparatively small weight to memory values of a filter portion which executes a convolutional operation in an updating process of the filter coefficients.
The above objects of the present invention are also achieved by a microphone array apparatus comprising: a microphone array including microphones (51-1, 51-2); linear predictive filters (52-1, 52-2) receiving output signals of the microphones; linear predictive analysis units (53-1, 53-2) which receives the output signals of the microphones and update filter coefficients of the linear predictive filters in accordance with a linear predictive analysis; and a sound source position detector (54) which obtains a crosscorrelation coefficient value based on linear predictive residuals of the linear predictive filters and outputs information concerning the position of a sound source based on a value which maximizes the crosscorrelation coefficient. Hence, even when speech of a speaker corresponding to the sound source and the noise are concurrently applied to the microphones, autocorrelation function values of samples about the speech signal are reduced to the linear predictive analysis, so that the position of the target source can accurately be detected. Thus, speech from the target sound can be emphasized and noise components other than the target sound can be suppressed.
The microphone array apparatus may be configured so that: a target sound source is a speaker; and the linear predictive analysis unit updates the filter coefficients of the linear predictive filters by using a signal which drives the speaker. Hence, the linear predictive analysis unit can be commonly used to the linear predictive filters corresponding to the microphones.
The above-mentioned objects of the present invention are achieved by a microphone array apparatus comprising: a microphone array including microphones (61-1, 61-2); a signal estimator (62) which estimates positions of estimated microphones in accordance with intervals at which the microphones are arranged by using the output signals of the microphones and a velocity of sound and which outputs output signals of the estimated microphones together with the output signals of the microphones forming the microphone array; and a synchronous adder (63) which pulls phases of the output signals of the microphones and the estimated microphones and then adds the output signals. Hence, even if a small number of microphones is used to form an array, the target sound can be emphasized and the position of the target sound source can precisely be detected as if a large number of microphones is used.
The microphone array apparatus may further comprise a reference microphone (71) located on an imaginary line connecting the microphones forming the microphone array and arranged at intervals at which the microphones forming the microphone array are arranged, wherein the signal estimator which corrects the estimated positions of the estimated microphones and the output signals thereof on the basis of the output signals of the microphones forming the microphone array.
The microphone array apparatus may further comprise an estimation coefficient decision unit (74) weights an error signal which corresponds to a difference between the output signal of the reference microphone and the output signals of the signal estimator in accordance with an acoustic sense characteristic so that the signal estimator performs a signal estimating operation on a band having a comparatively high acoustic sense with a comparatively high precision.
The microphone array apparatus may be configured so that: given angles are defined which indicate directions of a sound source with respect to the microphones forming the microphone array; the signal estimator includes parts which are respectively provided to the given angles; the synchronous adder includes parts which are respectively provided to the. given angles; and the microphone array apparatus further comprises a sound source position detector which outputs information concerning the position of a sound source based on a maximum value among the output signals of the parts of the synchronous adder.
The above objects of the present invention are also achieved by a microphone array apparatus comprising: a microphone array including microphones (91-1, 91-2); a sound source position detector (92) which detects a position of a sound source on the basis of output signals of the microphones; a camera (90) generating an image of the sound source; a second detector (93) which detects the position of the sound source on the basis of the image from the camera; and a joint decision processing unit (94) which outputs information indicating the position of the sound source on the basis of the information from the sound source position detector and the information from the second detector. Hence, the position of the target sound source can by rapidly and precisely detected.