1. Field
Embodiments relate to an apparatus and method for isolating a multi-channel sound source so as to separate each sound source from a multi-channel sound signal received at a plurality of microphones on the basis of stochastic independence of each sound source under the environment of a plurality of sound sources.
2. Description of the Related Art
Demand for technology, capable of removing a variety of peripheral noise and a voice signal of a third party from a sound signal generated when a user talks with another person in a video communication mode using a television (TV) in home or offices or talks with a robot, is rapidly increasing.
In recent times, under the environment such as Independent Component Analysis (ICA), including a plurality of sound sources, many developers or companies are conducting intensive research into a Blind Source Separation (BSS) technique capable of separating each sound source from a multi-channel signal received at a plurality of microphones on the basis of stochastic independence of each sound source.
BSS is a technology capable of separating each sound source signal from a sound signal in which several sound sources are mixed. The term “blind” indicates the absence of information about either an original sound source signal or a mixed environment.
According to Linear Mixture in which a weight is multiplied by each signal, each sound source can be separated using the ICA only. According to Convolutive Mixture in which each signal is transmitted from a corresponding sound source to a microphone through a medium such as air, it is impossible to isolate sound sources using ICA alone. In more detail, sound propagated from each sound source generates mutual interference in space when sound waves are transmitted through a medium such that a specific frequency component is amplified or attenuated. In addition, a frequency component of original sound is greatly distorted by reverb (echo) that is reflected from a wall or floor and then arrives at a microphone such that it is very difficult to recognize which frequency component present in the same time zone corresponds to which sound source. As a result, it is impossible to separate a sound source using ICA alone.
In order to obviate the above-mentioned problem, a first thesis (J.-M. Valin, j. Rouat, and F. Michaud, “Enhanced robot audition based on microphone array source separation with post-filter”, IEEE International Conference on Intelligent Robots and Systems (IROS), Vol. 3, pp. 2123-2128, 2004) and a second thesis (Y. Takahashi, T. Takatani, K. Osako, H. Saruwatari, and K. Shikano, “Blind Spatial Subtraction Array for Speech Enhancement in Noisy Environment,” IEEE Transactions on Audio, Speech, and Language Processing, Vol. 17, No. 4, pp. 650-664, 2009) have been proposed. Referring to the second thesis, beamforming for amplifying only sound from specific direction is applied to search for the position of the corresponding sound source, a separation filter created through ICA is initialized so that separation throughput can be maximized.
According to the first thesis, additional signal processing based on voice estimation technologies shown in the following third to fifth theses are applied to a signal separated by beamforming and geometric sound source (GSS) analysis, wherein the third thesis is I. Cohen and B. Berdugo, “Speech enhancement for non-stationary noise environments,” Signal Processing, Vol. 81, No. 11, pp. 2403-2418, 2001, the fourth thesis is Y. Ephraim and D. Malah, “Speech enhancement using minimum mean-square error short-time spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-32, No. 6, pp. 1109-1121, 1984 and the fifth thesis is Y. Ephraim and D. Malah, “Speech enhancement using minimum mean-square error log-spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-33, No. 2, pp. 443-445, 1985. As such, there is proposed a higher-performance speech recognition pre-processing technology in which separation performance is improved and at the same time reverb (echo) is removed so that clarity of a voice signal of a speaker is increased as compared to the conventional art.
ICA is largely classified into Second Order ICA (SO-ICA) and Higher Order ICA (HO-ICA). According to GSS proposed in the first thesis, SO-ICA is applied to the GSS, and a separation filter is initialized using a filter coefficient beamformed to the position of each sound source such that separation performance can be optimized.
Specifically, according to the first thesis, the probability of speaker presence (called speech presence probability) is applied to a sound source signal separated by GSS so as to perform noise estimation, the probability of speaker presence is re-estimated from the estimated noise so as to calculate a gain, the calculated gain is applied to GSS so that a clear speaker voice can be separated from a microphone signal in which other interference, peripheral noise and reverb are mixed.
However, according to sound source separation technology proposed in the first thesis, the same probability value of the speaker presence is used to perform noise estimation and gain calculation when a speaker's voice is separated from the peripheral noise and reverb from multi-channel sound source, and the probability of speaker presence is additionally calculated during noise estimation and gain calculation, so that a large number of calculations and serious sound quality distortion unavoidably occur.