Speech-based command interfaces can be used in vehicles. Applications include automatic dialog system for hands-free phone calls, as well as more advanced features, such, as navigation systems.
However, interference, such as speech from the codriver, rear-seat passengers, and noise, e.g., music or radio, engine and wind noise, can significantly degrade performance of an automatic speech recognition (ASR) system, which is crucial for those applications. This issue can be addressed with adaptive interference cancellation techniques, such as generalized sidelobe cancellation (GSC).
Beamformers based on GSC are well known. Typically, the beamformer includes a presteering front end, a fixed beam former (FBF), a blocking matrix (BM) and an adaptive canceller. The presteering front end is composed of various time delays allowing the main lobe of the beamformer to be steered to a selected direction. The FBF is used to enhance a target signal from the selected direction. However, the BM, composed of adaptive blocking filters (ABF), rejects the target signal, so that the blocking matrix contains interference and noise. The adaptive canceller, composed of adaptive canceling filters (ACF), is able to adjust weights so that the interferences and noise can be subtracted from the fixed beamformer output.
However, the conventional adaptive beamformer for GSC, like the simple Griffiths-Jim beamformer (GJBF), see U.S. Patent Applications 20100004929, 20070244698, 20060195324 and 20050049864 D. suffers from target signal cancellation due to steering-vector errors. The steering-vector errors are due to errors in microphone positions, microphone gains and real world recordings of, e.g., reverberation, noise and a moving target. Indeed, the beamformer is constrained to produce a dominant response toward the selected location of the source of the speech, while minimizing the response in all other directions.
However, in reverberant environments a single direction of arrival cannot be determined because the desired signal and its reflection impinge on the microphone array from several directions. Thus, complete rejection of the target signal is almost impossible in the BM and a considerable portion of the desired speech is subject to interference cancellation, which results in target signal cancellation.
In addition, the original formulation of the GJBF was based on the general use of beamforming, where the far-field assumption is often valid, such as in radio astronomy or geology.
However, in a vehicle, the microphone array can span about one meter, meaning that the far field assumption is no longer valid. This change in the physics of the system also causes leakage in the conventional Griffiths-Jim BM because now the target signal is no longer received at each microphone with equal amplitude.
Applying the GSC uniformly to an entire utterance, without considering the observed data, is not efficient. It is not necessary to process noise only and single speaker segments using the GSC if they can be accurately labeled.
In particular, non-overlapping speech, and non-speech occur more often than overlapping speech, and each case needs to be handled differently
GSC
FIG. 1 shows a conventional GSC, which is a simplification of the well known Frost Algorithm. It is assumed that all input channels 101 have already been appropriately steered toward a point of interest. The GSC includes an upper branch 110, often called the Fixed Beamformer (FBF), and a lower branch 120 including a Blocking Matrix (BM) outputting to normalized least mean square modules 140, whose outputs are also summed 150.
The conventional Delay and Sum beamformer for FBF is to sum 130 the observed signals xm from the microphone array as
                                                        y              FBF                        ⁡                          (              t              )                                =                                    1              M                        ⁢                                          ∑                                  m                  =                  1                                M                            ⁢                                                x                  m                                ⁡                                  (                                      t                    -                                          τ                      m                                                        )                                                                    ,                            (        1        )            where τm is the delay for the mth microphone, for a given steering direction.
The lower branch utilizes an unconstrained adaptive process on a set of tracks that have passed through the BM, including of some process intended to eliminate the target signal from the incoming data in order to form a reference of the noise. The particular Griffiths-Jim BM takes pairwise differences of signals, which can be expressed for a four-microphone instance as
                              W          b                =                  (                                                    1                                                              -                  1                                                            0                                            0                                                                    0                                            1                                                              -                  1                                                            0                                                                    0                                            0                                            1                                                              -                  1                                                              )                                    (        2        )            
For this Wb the BM output tracks are determined as the matrix product of the BM and matrix of current input datayBM(t)=WbX(t)  (3)where X(t)=[x1(t), x2(t), . . . , xM(t)]. The overall beamformer output y(t) 102, is determined as the DSB signal minus 160 the sum 150 of the adaptively-filtered BM tracks
                              y          ⁡                      (            t            )                          =                                            y              FBF                        ⁡                          (              t              )                                -                                    ∑                              m                =                1                                            M                -                1                                      ⁢                                          ∑                                  i                  =                                      -                                          K                      L                                                                                        K                  R                                            ⁢                                                                    g                                          m                      ,                      i                                                        ⁡                                      (                    t                    )                                                  ⁢                                                                            y                      m                      BM                                        ⁡                                          (                                              t                        -                        i                                            )                                                        .                                                                                        (        4        )            
Define, for m=1, . . . , M−1xm(t)=(xm(t+KL), . . . ,xm(t), . . . ,xm(t−KR)),andgm(t)=(gm,−KL(t), . . . ,gm,0(t), . . . ,gm/KR(t)),then the adaptive normalized multichannel least mean square (LMS) solution is
                                                                        g                m                            ⁡                              (                                  t                  +                  1                                )                                      =                                                            g                  m                                ⁡                                  (                  t                  )                                            +                                                μ                                                            p                      est                                        ⁡                                          (                      t                      )                                                                      ⁢                                                      x                    m                                    ⁡                                      (                    t                    )                                                  ⁢                                  y                  ⁡                                      (                    t                    )                                                                                ;                ⁢                                  ⁢                              m            =            1                    ,          …          ⁢                                          ,                      M            -            1                    ,                                          ⁢          where                                    (        5        )                                                      p            est                    ⁡                      (            t            )                          =                              ∑                          m              =              1                        M                    ⁢                                                                                                          x                    m                                    ⁡                                      (                    t                    )                                                                              2              2                        .                                              (        6        )            