1. Field of the Invention
The present invention relates to a beamforming apparatus and a beamforming method, and more particularly to an apparatus and a method for performing beamforming for an input signal in consideration of an actual noise environment character.
2. Description of the Related Art
In general, a microphone refers to a transducer for converting acoustic signals conveyed through air vibration into electrical signals. With the recent development of robot control technologies, a microphone has been used as a robot audio interface, i.e. a means for freely communicating ideas between a robot and a user. The robot converts speech signals, which are input through a microphone used as a robot audio interface, into electrical signals and analyzes the converted data, thereby recognizing a user's speech. In addition to the robot, a speech recognition apparatus providing a speech recognition service through the equipped microphone has been increasingly developed.
In a case of such a speech recognition apparatus receiving specific speech signals, if a microphone of the apparatus is located to have directivity towards a direction in which the speech signals are input, the speech recognition apparatus can prevent input of noise occurring in a surrounding environment. In this case, only one microphone having a high directivity can also have directivity towards a direction in which specific speech signals are input. However, when a microphone array is formed by arranging a number of microphones instead of one microphone, it is possible to freely acquire a directivity character suitable for user purposes. Therefore, it is common for a speech recognition apparatus to be equipped with a microphone array enabling use of an audio interface.
Meanwhile, when a software process is performed to eliminate noise for speech signals input through a microphone array, beams are formed from the microphone array toward a specific direction according to the software process. In order to achieve a high directivity from a microphone to a desired direction after forming beams by such a microphone array, a beamforming technology is used.
If a high directivity is formed toward the direction in which a user speech is input through the above-described beamforming, speech signals input from the outside of the beams are automatically reduced. Therefore, it is possible to selectively acquire speech signals input from the direction of interest. The microphone array can suppress surrounding noise, such as noise from an indoor computer fan, television sounds, etc, and the partial reverberation retro-reflected from objects, such as furniture and walls. That is, the microphone array can acquire a higher Signal to Noise Ratio (SNR) for speech signals generated from beams of the interesting direction, by using the beamforming technology. Therefore, the beamforming points beams to a sound source and plays an important role in spatial filtering which suppresses all signals input from different directions.
The beamformer performing beamforming for such input signals shows effective performance as it consistently has over all frequency domains. In this case, a beamformer using a Minimum Variance Distortionless Response (MVDR) algorithm is generally used in a noise environment having a stationary character.
A construction by which a beamformer using an MVDR algorithm performs a beamforming operation and outputs a noise-eliminated signal will be described with reference to FIG. 1.
First, when speech signals on the time domain input through the microphone array 100 are transformed into signals on the frequency domain, and the resultant signals are input to the beamforming unit 110, the beamforming unit 110 can derive output values using Equation (1) below.
                              Y          ⁡                      (            ω            )                          =                              ∑                          i              =              0                                      N              -              1                                ⁢                                                    X                i                            ⁡                              (                ω                )                                      ⁢                                          W                i                            ⁡                              (                ω                )                                                                        (        1        )            
In Equation (1), N denotes the number of microphones constituting the microphone array 100, Xi(ω) represents an ith input signal on the frequency domain from among N microphones. Also, a filter factor called Wi of Equation (1) is determined depending on a model format defining a noise environment.
The MVDR algorithm based on a minimum variance solution is widely used as an algorithm for performing beamforming so as to suppress noise from all directions except for a desired direction of input signals in the microphone array 100.
A filter factor value ‘W’ for performing beamforming through such an MVDR algorithm is defined by Equation (2) below.
                    W        =                                            Γ                              -                1                                      ⁢            d                                              d              H                        ⁢                          Γ                              -                1                                      ⁢            d                                              (        2        )            
In Equation (2), d is a vector affecting decision of the direction so that microphone array 100 is oriented toward a sound source. In a Uniform Linear microphone Array (ULA) arranged with a same distance between adjacent microphones, d can be expressed as defined by Equation (3) below.d=[d1d2 . . . dn]Γ  (3)
In Equations (2) and (3),
            d      n        =          exp      ⁡              (                              -            j                    ⁢                                    ω              ⁢                                                          ⁢              d                        c                    ⁢                      (                          n              -              1                        )                    ⁢          cos          ⁢                                          ⁢          θ                )              ,c represents the speed of sound, n represents a serial number of a corresponding microphone, d represents distance between microphones, and θ represents an angle of incident speech signals with respect to the array. Γ represents a coherence matrix, which can be expressed by Equation (4) below.
                    Γ        =                  (                                                    1                                                              Γ                                                            X                      0                                        ⁢                                          X                      1                                                                                                  ⋯                                                              Γ                                                            X                      0                                        ⁢                                          X                                              N                        -                        1                                                                                                                                                                  Γ                                                            X                      1                                        ⁢                                          X                      0                                                                                                  1                                            ⋯                                                              Γ                                                            X                      1                                        ⁢                                          X                                              N                        -                        1                                                                                                                                                ⋮                                            ⋮                                            ⋱                                            ⋮                                                                                      Γ                                                            X                                              N                        -                        1                                                              ⁢                                          X                      0                                                                                                                    Γ                                                            X                                              N                        -                        1                                                              ⁢                                          X                      1                                                                                                  ⋯                                            1                                              )                                    (        4        )            
In Equation (4), each component of the coherence matrix corresponds to coherence for the input X0X1, which can be defined by Equation (5) below. Herein, Φ represents Power Spectral Density (PSD) between two input noise signals.
                                          Γ                                          X                0                            ⁢                              X                1                                              ⁡                      (            ω            )                          =                                            Φ                                                X                  0                                ⁢                                  X                  1                                                      ⁡                          (              ω              )                                                                                            Φ                                                            X                      0                                        ⁢                                          X                      0                                                                      ⁡                                  (                  ω                  )                                            ⁢                                                Φ                                                            X                      1                                        ⁢                                          X                      1                                                                      ⁡                                  (                  ω                  )                                                                                        (        5        )            
That is, performance of the beamforming unit 110 is determined according to a spatial character of only an input signal. Therefore, if a coherence of a noise environment is well defined, it is possible to effectively improve the performance of the beamforming unit 110.
Generally, in an indoor noise environment, signals are retro-reflected and diffused due to obstacle, such as walls, and furniture. Therefore, signals input from all directions of a noise environment to the microphone are regarded to have constant power, which is called a diffuse environment. If dij represents a space between a microphone i and a microphone j, a coherence in an ideal diffuse environment can be defined by using a sinc function as shown in equation (6). Coherences are calculated by using the sinc function as shown in equation (6) below and the resultant values are applied to a beamformer, which is called a super-directive beamformer.
                                          Γ                                          X                i                            ⁢                              X                j                                              ⁡                      (            ω            )                          =                  sin          ⁢                                          ⁢                      c            ⁡                          (                                                ω                  ⁢                                                                          ⁢                                      d                    ij                                                  c                            )                                                          (        6        )            
As such, a conventional beamformer calculates coherences by applying the above-described Equation (6) using the sinc function, which is fixed regardless of data based on an actual noise magnitude. By using the calculated coherences, the beamformer is employed and applied to a noise filtering.
As described above, since an indoor environment, such as a house or an office has a reverberant character against signals, the environment can be assumed as a diffuse environment. However, an actual coherence significantly changes according to a noise environment, as shown in FIG. 2, so that there is much difference between the actual coherence and a fixed sinc function. Referring to FIG. 2, as much error as the hatched area occurs between the sinc function and an actual coherence measured by a microphone.
If a speech recognition apparatus is placed at an ideal diffuse environment and speech signals are input from such a diffuse environment to the speech recognition apparatus, a coherence between two input signals on the low frequency domain must be approximated to have a value of 1. However, the coherence has practically different values depending on a position and a space at which the microphones are arranged. Even if the same kind of microphone is used, each microphone has a different gain. An actual measurement coherence may have frequently different values since the microphone itself generates noise.
However, a coherence used in a current beamformer corresponds to a coherence calculated by using only a fixed sinc function regardless of an actual noise environment, as shown in Equation (6). Therefore, as shown in FIG. 2, as much error as the hatched area occur as compared with coherences calculated by reflecting a sinc function and an actual noise environment. Accordingly, if a beamforming unit 110 is implemented by simply applying only a sync function, it is difficult to acquire optimal performance.