1. Field of the Invention
The present invention relates to speaker localization, and more particularly, to a method and apparatus for noise-robust speaker localization using spectral subtraction between the pseudo-power spectrum in a speech section of an input signal and the pseudo-power spectrum in a non-speech section of the input signal and an automatic camera steering system employing the same.
2. Description of Related Art
Recently, a mobile robot operating in indoor environments has been highlighted by the needs for health, safety, home networking, entertainment, and so on. Human robot interaction (HRI) is essential in this mobile robot. Typically such a robot has a microphone, a vision system, ultrasound sensors, infrared sensors, laser sensors, and the like, and by using these devices, should recognize human beings and surrounding situations. In particular, the location of a person talking around the robot should be identified and the person's speech should be understood so that HRI can be efficiently implemented.
In a mobile robot, a voice and sound input system is an essential element not only for HRI but also for autonomous navigation. Important issues arising in a voice input system in an indoor environment include noise, reverberation, and distance. In an indoor environment, there are reverberations caused by a variety of noise sources, walls and other objects. The low frequency component of voice has a characteristic that it is attenuated more than the high frequency component with respect to distance. In an indoor environment where there is noise, a voice input system needed for HRI should enable a mobile robot to autonomously navigate and receive the voice of a user at a distance of several meters and identify the location of the user and the voice can be used directly for speech recognition through speech enhancement and noise removal.
Generally, methods of estimating sound source direction are broken down into beamformer based methods, time delay of arrival (TDOA) based methods, and spectrum estimation based methods. Beamformer based methods have shortcomings. Firstly, in addition to the frequency component of noise, the frequency component of sound source should be known in advance. Secondly, an objective function which should be minimized does not have only one global minimum value, but can frequently have a plurality of local minimum values. Accordingly, these beamformer based methods are not appropriate for sound source direction estimation.
Meanwhile, TDOA based methods usually use two microphones, obtain the time difference between signals arriving at the two microphones from a sound source, and estimate the direction of the sound source. General cross-correlation (GCC) is a leading example. This method has a drawback that if there is a reverberation, the performance rapidly degrades and is greatly affected by the characteristic of background noise. In addition, there are restrictions that only two microphones are usually used and this method can be applied only to a free space. Accordingly, if a plurality of microphones are arranged on the circumference of the body of a robot in order to cover 360° and there are no direct paths from a sound source to respective microphones, an inaccurate time difference is obtained. Therefore, TDOA methods are not appropriate for sound source direction estimation.
Meanwhile, spectrum estimation based methods find the direction of a sound source by estimating and analyzing frequency components of a signal incident on a microphone array. The spectrum estimation based methods include an autoregressive method, a minimum variance method and a subspace method. Among them, the subspace method has the advantage that the method is relatively free from the restriction that the estimation can be applied only to a free space and therefore it is easy to apply the method to an indoor environment. Methods using subspace include multiple signal classification (MUSIC) and estimation of signal parameters via rotationally invariant techniques (ESPRIT). Among them, a MUSIC algorithm is known as the most frequently used and to have the best performance. The MUSIC algorithm is disclosed in detail in an article by R. O. Schmidt, “Multiple Emitter Location and Signal Parameter Estimation,” IEEE Trans. Antenna Propag., vol. AP-34, pp. 276-280, March, 1986, and an ESPRIT algorithm is disclosed in detail in an article by R. Roy and T. Kailath, “Estimation of Signal Parameters via Rotational Invariance Techniques,” IEEE Trans. Acoust., Speech Process., vol. ASSP-37, pp. 984-995, 1989.
According to the MUSIC algorithm, voice signals from M microphones forming a microphone array are input and each voice signal is divided into sections of a specified length. Then, an M×M covariance matrix of the voice signal in each divided section is obtained and by using eigenvalue decomposition, the basis vector in a noise subspace is obtained from the covariance matrix and by projecting a steering vector obtained in advance onto the basis vector of the noise subspace, and a pseudo-power spectrum is obtained. Then, since a steering vector corresponding to the direction of the steering of the sound source has a small value close to ‘0’ when projected onto the basis vector in the noise subspace, the pseudo-power spectrum in that direction has a very big value. If peak values in the pseudo-power spectrum covering 360° are finally obtained, the direction angle corresponding to the peak value becomes the direction of each sound source.
Theoretically, the MUSIC algorithm can find the direction of each sound source when the number of sound sources is less than the number of microphones used in the microphone array. For convenience of explanation, if it is assumed that there is one voice source (speaker) and one noise source, usually a direction having the highest peak value is determined as the direction of the speaker. However, in a noisy environment, both directions of noise and voice can be estimated, but it is impossible to distinguish the direction of the speaker desired to find from the other direction. For example, if the power of noise is greater than the power of voice when a direction in which the amplitude of a pseudo-power spectrum is the largest is estimated as the voice direction, there is a problem that the direction of noise can be taken for the direction of voice.