The present invention relates to audio signal processing and, in particular, to an apparatus and method for an informed multichannel speech presence probability estimation.
Audio signal processing becomes more and more important. In particular, hands-free capture of speech is necessitated in many human-machine interfaces and communication systems. Built-in acoustic sensors usually receive a mixture of desired sounds (e.g., speech) and undesired sounds (e.g., ambient noise, interfering talkers, reverberation, and sensor noise). As the undesired sounds degrade the quality and intelligibility of the desired sounds, the acoustic sensor signals may be processed (e.g., filtered and summed) in order to extract the desired source signal or, stated differently, to reduce the undesired sound signals. To compute such filters, an accurate estimate of the noise power spectral density (PSD) matrix is usually necessitated. In practice, the noise signal is unobservable and its PSD matrix needs to be estimated from the noisy acoustic sensor signals.
Single-channel speech presence probability (SPP) estimators have been used to estimate the noise PSD (see, e.g. [1-5]) and to control the tradeoff between noise reduction and speech distortion (see, e.g. [6, 7]). Multichannel a posteriori SPP has recently been employed to estimate the noise PSD matrix (see, e.g. [8]). In addition, the SPP estimate may be used to mitigate the power consumption of a device.
In the following, the well-established signal model in multichannel speech processing will be considered, where each acoustic sensor of an M-element array captures an additive mixture of a desired signal and undesired signal. The signal received at the m-th acoustic sensor can be described in the time-frequency domain as followsYm(k,n)=Xm(k,n)+Vm(k,n),  (1)
where Xm(k, n) and Vm(k, n) denote the complex spectral coefficients of the desired source signal the noise component m-th acoustic sensor, respectively, and n and k are the time and frequency indices, respectively.
The desired signal may, e.g., be spatially coherent across the microphones and the spatial coherence of the noise may, e.g., follow the spatial coherence of an ideal spherically isotropic sound field, see [24].
In other words, e.g., Xm(k, n) may denote the complex spectral coefficients of the desired source signal at the m-th acoustic sensor, Vm(k, n) may denote the complex spectral coefficients of the noise component at the m-th acoustic sensor, n may denote the time index and k may denote the frequency index.
The observed noisy acoustic sensor signals can be written in vector notation asy(k,n)=[Y1(k,n) . . . YM(k,n)]T  (2)
and the power spectral density (PSD) matrix of y(k, n) is defined asΦyy(k,n)=E{y(k,n)yH(k,n)},  (3)
where the superscript H denotes the conjugate transpose of a matrix. The vectors x(k, n) and v(k, n) and the matrices Φxx(k, n) and Φvv(k, n) are defined similarly. The desired and the undesired signals are assumed uncorrelated and zero mean, such that formula (3) can be written asΦyy(k,n)=Φxx(k,n)Φvv(k,n).  (4)
The following standard hypotheses is introduced regarding the presence of a desired signal (e.g., a speech signal) in a given time-frequency bin:
H0(k, n):y(k; n)=v(k; n) indicating speech absence, and
H1(k, n):y(k; n)=x(k; n)+v(k; n) indicating speech presence.
It may, e.g., be appreciated to estimate the conditional a posteriori SPP, i.e., p[H1(k, n)|y(k, n)].
Assuming that one takes the i-th microphone of the array as a reference, it may, e.g., be appreciated to estimate the desired signal Xi(n, k).
Under the assumption that the desired and undesired components can be modelled as complex multivariate Gaussian random variables, the multichannel SPP estimate is given by (see [9]):
                              p          ⁡                      [                                                            H                  1                                ⁡                                  (                                      k                    ,                    n                                    )                                            |                              y                ⁡                                  (                                      k                    ,                    n                                    )                                                      ]                          =                              {                          1              +                                                                                          q                      ⁡                                              (                                                  k                          ,                          n                                                )                                                                                    1                      -                                              q                        ⁡                                                  (                                                      k                            ,                            n                                                    )                                                                                                      ⁡                                      [                                          1                      +                                              ξ                        ⁡                                                  (                                                      k                            ,                            n                                                    )                                                                                      ]                                                  ⁢                                  ⅇ                                      -                                                                  β                        ⁡                                                  (                                                      k                            ,                            n                                                    )                                                                                            1                        +                                                  ξ                          ⁡                                                      (                                                          k                              ,                              n                                                        )                                                                                                                                                                    }                                -            1                                              (        5        )            
where q(k, n)=p[H1(k, n)] denotes the a priori speech presence probability (SPP), andξ(k,n)=tr{Φvv−1(k,n)Φxx(k,n)},  (6)β(k,n)=yH(k,n)Φvv−1(k,n)Φxx(k,n)Φvv−1(k,n)y(k,n),  (7)
where tr{•} denotes the trace operator. Alternative estimators assuming another type of distribution (e.g., a Laplacian distribution) may also be derived and used.
Only under the assumption that the desired signal PSD matrix is of rank one [e.g., Φxx(k, n)=φxixi(k,n)γ(k, n)γiH(k,n) with φxixi(k, n)=E{|Xi(k, n)|2} and γi denotes a column vector of length M], the multichannel SPP can be obtained by applying a single-channel SPP estimator to the output of a minimum variance distortionless response (MVDR) beamformer.
State-of-the-art approaches either use a fixed a priori SPP [4, 9] or a value that depends on the single-channel or multichannel a priori signal-to-noise ratio (SNR) (see [2, 8, 10]). Cohen et al. [10], use three parameters local(k, n), global(k, n), and frame(n) that are based on the time-frequency distribution of the estimated single-channel a priori SNR, to compute the a priori SPP given byq(k,n)=local(k,n)global(k,n)frame(n).  (8)
These parameters exploit the strong correlation of speech presence in neighboring frequency bins of consecutive time frames. In other approaches of the state of the art (see [11]), the parameters are computed in the log energy domain. In further approaches of the state of the art (see [8]), the multichannel a priori SNR was used instead to compute local(k, n), global(k, n), and frame(n).
A major shortcoming of state-of-the-art SPP estimators is that they cannot distinguish between desired and undesired sounds.