The present invention relates to an apparatus for detecting a position of an object, a method therefor, a voice collecting apparatus, a method therefor, a filter calculation apparatus and a method therefor.
Hitherto, the position of an object at a doorway of a house or indoor has been detected by using a video camera such that obtained image information is processed or by detecting change in the applied radio waves or light by a sensor. However, the above-mentioned methods cannot detect an object if the object is located at a shadowed position or an object is out of the visual field of the camera. Accordingly, a method has been disclosed in Jpn. Pat. KOKAI Publication No. 7-146366 which uses a diffraction effect of sound waves to detect an object located at a shadowed position. The foregoing method is structured such that sound waves are radiated to obtain an acoustic transmission characteristic by detecting the echo of the sound waves so that the position of an object is detected in accordance with the difference in the transmission characteristic occurring attributable to the existence of the object. At this time, one sound source and a plurality of sensors or a plurality of sound sources for transmitting the same signal and one sensor are used to measure impulse response which is expression of a time region of the acoustic transmission characteristic so as to detect the position of the object.
In order to be adapted to a voice recognition apparatus or a television conference system, a noise suppression technique using a directional microphone or a microphone array and capable of collecting voice having an excellent quality has been suggested. To automatically obtain voice and the image of a speaker from a plurality of attendants of the conference using the television conference system, a method has been disclosed in, for example, Jpn. Pat. KOKAI Publication No. 5-227531 which has a structure such that signals from a plurality of microphones are processed in accordance with the position of a mobile object obtained by processing an image picked up by a video camera.
However, the above-mentioned method, having the structure such that the signals from the microphone array are processed by a delay sum method for aligning the phases of the signals with respect to the voice from the position of a required person, suffers from a problem in that the effect of suppressing noise arrived from another direction is unsatisfactory.
As a technique for effectively suppressing noise by processing outputs from a microphone array by using an adaptive filter to control the directionality has been known as disclosed in, for example, a document "Acoustic System and Digital Processing", edited by Electronic Information Communication Society, pp. 171-218. Although the adaptive microphone array process is not required to detect the noise direction of arrival, the direction, in which the required sound wave is transmitted, is processed as a known factor. Although the direction of arrival, can be estimated by processing signals from the microphone array, detection can be performed only in the period in which speech is uttered. Therefore, the stability of the process has a problem.
Another method has been known in which the position of a person obtained by processing the image is used as the arriving direction of the object sound. In this case, the process can be performed stably because the position can be estimated even if no speech is uttered as disclosed in, for example, a document ICASSP '95 "Knowing Who to Listen to in Special Recognition Visually Guided Beamforming", pp. 848-851.
A process of signals obtained by an antenna array or a microphone array formed by using a plurality of antennas or microphones mainly uses an adaptive filter in order to automatically eliminate noise arrived from unknown directions. In particular, an adaptive filter having a constraint condition is a convenient filter because an adaptive process for eliminating noise from a unknown direction can be performed in a state where the response of the array with respect to the objective direction is maintained. Therefore, the adaptive filter is widely employed.
As described in a document "Adaptive Filter Theory", PRENTICE HALL, written by Haykin, the adaptive filter having a constraint condition is structured to minimize the output from a delay-line tap filter under a constraint condition expressed by a linear equation so as to obtain an optimum filter coefficient. Since the constraint condition determines the response of a filter with respect to a certain direction or a frequency and it must generally be expressed with a complex number, also the filter coefficient is expressed with a complex number. However, there arises a problem in that the filter in the form of a complex number results in enlargement of the amount of calculations as compared with a filter in the form of a real number if the number of taps is the same.
When input signals X for plural channels are supplied to filter W provided with a delay-line tap of each channel (corresponding to sensors 1, . . . , sensor i, . . . , sensor M) as shown in FIG. 47, the minimum dispersion filter having a constraint condition can be obtained by minimizing an expected value of the following output power from the filter under condition that the response with respect to an object direction is retained to be constant: EQU E[y.sup.2 ]=E[W.sup.H XX.sup.H W]=W.sup.H RW (1")
where E [ ] is an expected value.
Assuming that the filter coefficient at the j-th tap of the i-channel is w.sub.ij, filter W is expressed as follows: EQU Filter W=(w.sub.11, w.sub.12, . . . , w.sub.ij-1, w.sub.ij, w.sub.ij+1, . . . , w.sub.ML).sup.T.
Assuming that the signal to be supplied to the j-th tap of the i-channel is x.sub.ij, the input signal X is expressed as follows: EQU X=(x.sub.11, x.sub.12, . . . , x.sub.ij-1, x.sub.ij, x.sub.ij+1, . . . x.sub.ML).sup.T
where R=E[XX.sup.H ] is an autocorrelation matrix of x, M is the number of channels and L is the number of taps.
The constraint condition is expressed as follows: EQU A.sup.H W=G (2")
where G is a column vector of a constant value, the magnitude of which is the number K of the constraint condition and is, for example, [1, 1, . . . , 1], and A is a matrix having a steering vector a.sub.k with respect to a different frequency as the column vector thereof and is expressed as follows: EQU A=[a.sub.1, . . . , a.sub.k ] (3")
Each vector a.sub.k (k=1, . . . , K) is expressed as follows: EQU a.sub.k =(1, e.sup.-j.omega. k.sup..tau. 2, . . . , e.sup.-j.omega. k.sup..tau. M).sup.T (4")
where .tau..sub.2, . . . , .tau..sub.N are differences in the propagation time of signals which are supplied to respective channels when the first channel is made to be a reference, and wk is an angular frequency. The difference in the propagation time is determined in accordance with the position of an antenna or a sensor on which a signal is made incident and the spatial angle of the incident signal.
Although the minimizing issue expressed in Equations (1") and (2") may directly be solved by a method of Lagrange multipliers, the solution is usually iteratively obtained by using, for example, a Least Mean Square (LMS) adaptive filter in order to process the signals which are supplied sequentially. In this case, the filter coefficient W.sub.n updated owning to n times of repetition is expressed by the following equation in accordance with a projection type LMS algorithm which has been described in, for example, O. L. Frost, III, "Algorithm for Linearly Constrained Adaptive Array Processing", Proceeding of the IEEE, Vol. 60, No. 8, pp. 926-935 (1972): EQU W.sub.n =P[w.sub.n-1 -.mu.y.sub.n X]+F (5")
where W.sub.n is the filter coefficient updated n times, P is a projection matrix onto a subspace which is determined in accordance with the constraint condition, F is a parallel translation vector from the subspace to a space which satisfies the constraint condition and .mu.is the step size, P and F being calculated as follows: EQU P=I-A(A.sup.H A).sup.-1 A.sup.H (6") EQU F=A(A.sup.H A).sup.-1 G (7")
If Equation (4") is expressed with a complex number, the foregoing calculations must be performed in the form of complex numbers.
However, the foregoing methods, having the structure such that the time delay is performed to make the phases of the signals transmitted from an object direction to be the same and then the constraint condition for the object is set, an assumption can be performed that no time difference exists among the input channels. Therefore, the constraint condition can be expressed in the form of real numbers. Under the real number constraint condition, the optimum filter is calculated by using real numbers.
However, the above-mentioned method of detecting the position of an object, having the structure such that information of one object nearest the measuring point is extracted, is able to detect only one object. Therefore, there arises a problem in that the method cannot be employed when a plurality of objects are required to simultaneously be detected.
Since the above-mentioned voice collecting apparatus cannot be used when a plurality of positions of persons are detected as a result of a process of an image, an adaptive process has been performed to remove speech of a person which is not the object person if the speech has been performed. However, if interference sound is mixed before the adaptation process is completed or if a plurality of speakers simultaneously speech, there arises a problem in that voices except for the remarked person cannot clearly be input.
The above-mentioned method of calculating a filter is adapted to a case where a plurality of object directions exist by determining the constraint condition in the plural directions. Although the constraint condition with respect to one direction can be expressed with a real number by performing the process for delaying the input signal, the constraint condition with respect to other directions must be expressed by complex numbers in order to express the time difference between the channels for the input signals. Therefore, also the calculation for obtaining the filter coefficient must be performed by using the complex number. In this case, there arises a problem in that the quantity of calculation cannot be reduced.
In document A (K. Takao et. al., "An adaptive antenna array under directional constraint", IEEE Trans. Antennas Propagat., vol. AP-24, pp. 662-669, September 1976), a method has been disclosed in which the constraint condition is determined for each frequency and the calculations are performed by using real numbers. However, there arises a problem in that the number of the constraint conditions must be enlarged sufficiently to prevent occurrence of a ripple in the frequency characteristic in the object direction.
In document B (K. M. Buckley, "Spatial Spectral Filtering with Linearly Constrained Minimum Variance Beamformers", IEEE Trans. acoustics, speech, and signal processing, Vol. ASSP-35, No. 3, March 1987), the constraint condition is determined in accordance with eigenvalue decomposition of a correlation matrix of an input signal. However, the eigenvalue decomposition must perform a large quantity of calculations. Thus, there arises a problem when the object direction is frequently changed.