Model-based speech enhancement methods, such as vector-Taylor series (VTS)-based methods use statistical models of both speech and noise to produce estimates of an enhanced speech from a noisy signal. In model-based methods, the enhanced speech is typically estimated directly by determining its expected value according to the model, given the noise.
Direct Vector-Taylor Series-Based Methods
In high-resolution noise compensation techniques, the mixed speech and noise signals are modeled by Gaussian distributions or Gaussian mixture models in the short-time log-spectral domain, rather than in a feature domain having a reduced spectral resolution, such as the mel spectrum typically used for speech recognition. This is done, along with using the appropriate complementary analysis and synthesis windows, for the sake of perfect reconstruction of the signal from the spectrum, which is impossible in a reduced feature set.
Here, the short-time speech log spectrum xt at frame t is conditioned on a discrete state st. The noise is quasi-stationary, hence only a single Gaussian distribution is used for the noise log spectrum nt:
                              p          ⁡                      (                                          x                t                            ,                              s                t                                      )                          =                              p            ⁡                          (                              s                t                            )                                ⁢                      𝒩            (                                          x                ⁢                                                                                              μ                                              x                        ⁢                                                                                                        s                            t                                                                                                                ,                                          Σ                                              x                        ⁢                                                                                                        s                            t                                                                                                                                )                                            ,                                                          ⁢                                                p                  ⁡                                      (                                          n                      t                                        )                                                  =                                  𝒩                  (                                                                                    n                        t                                            ⁢                                                                                                                            μ                            n                                                    ,                                                      Σ                            n                                                                          )                                                              ,                                                                                                          (        1        )            where (·|μ, Σ) denotes the Gaussian distribution  with mean μ and variance Σ.
The log-sum approximation uses the logarithm of the expected value, with respect to the phase, in the power domain to define an interaction distribution over the observed noisy spectrum yf,t in frequency f and frame t:
                    p        (                                            y                              f                ,                t                                      ⁢                                                                          x                                      f                    ,                    t                                                  ,                                  n                                      f                    ,                    t                                                              )                                ⁢                      =            def                    ⁢                      𝒩            (                                                            y                                      f                    ,                    t                                                  ⁢                                                                                              log                      ⁡                                              (                                                                              ⅇ                                                          x                                                              f                                ,                                t                                                                                                              +                                                      ⅇ                                                          n                                                              f                                ,                                t                                                                                                                                    )                                                              ,                                          ψ                      f                                                        )                                            ,              ,                                                          (        2        )            where Ψ=(ψf)f is a variance intended to handle the effects of phase.
To perform inference in this model requires determining the following likelihood and posterior integrals
                    p        (                                            y              t                        ⁢                                                        s                t                            )                                =                      ∫                          p              (                                                y                  t                                ⁢                                                                                              x                      t                                        ,                                          n                      t                                                        )                                ⁢                                  p                  ⁡                                      (                                          n                      t                                        )                                                  ⁢                                  p                  (                                                                                    x                        t                                            ⁢                                                                                                s                          t                                                )                                            ⁢                                              ⅆ                                                  x                          t                                                                    ⁢                                              ⅆ                                                  n                          t                                                                                      ,                                                                                                          (        3        )                                E        (                                            x              t                        ⁢                                                        s                t                            )                                =                      ∫                                          x                t                            ⁢                              p                (                                                      x                    t                                    ,                                                            n                      t                                        ⁢                                                                                                                  y                          t                                                ,                                                  s                          t                                                                    )                                        ⁢                                          ⅆ                                              x                        t                                                              ⁢                                          ⅆ                                              n                        t                                                                              ,                                                                                        (        4        )                                                          ⁢                  =                      ∫                                          x                t                            ⁢                                                p                  (                                                            y                      t                                        ⁢                                                                                                                  x                          t                                                ,                                                  n                          t                                                                    )                                        ⁢                                          p                      ⁡                                              (                                                  n                          t                                                )                                                              ⁢                                          p                      (                                                                        x                          t                                                ⁢                                                                                                        s                            t                                                    )                                                                                                                                      p                  (                                                            y                      t                                        ⁢                                                                                        s                        t                                            )                                                                                  ⁢                              ⅆ                                  x                  t                                            ⁢                                                ⅆ                                      n                    t                                                  .                                                                        (        5        )            
These integrals are intractable due to the nonlinear interaction function in Eqn. (2). In iterative VTS, this limitation is overcome by linearizing the interaction function at the current posterior mean, and then iteratively refining the posterior distribution.
In the following, the variable t is omitted for clarity. To simplify the notation, x and n can be concatenated to form a joint vector z=[x;n], where “;” indicates a vertical concatenation. The prior probability is defined as
                    p        (                              z            ⁢                                        s              )                                =                      𝒩            (                                          z                ⁢                                                                                              μ                                              z                        ⁢                                                                            s                                                                                      ,                                          Σ                                              z                        ⁢                                                                            s                                                                                                      )                                            ,              ,                                                          ⁢              where                                                                                                                μ                          z              ⁢                                              s                                              =                      [                                                                                μ                                          x                      ⁢                                                                      s                                                                                                                                                              μ                    n                                                                        ]                          ,                              Σ                          z              ⁢                                              s                                              =                                    [                                                                                          Σ                                              x                        ⁢                                                                            s                                                                                                                          0                                                                                        0                                                                              Σ                      n                                                                                  ]                        .                                              (        6        )            
The interaction function is defined as g(z)=log(ex+en), where the log and exponents operate element-wise on x and n.
The interaction function is linearized at {tilde over (z)}s, for each state s, yielding:plinear(y|z;{tilde over (z)}s)=(y;g({tilde over (z)}s)+Jg({tilde over (z)}s)(z−{tilde over (z)}s),Ψ),  (7)where Jg({tilde over (z)}s) is the Jacobian matrix of g, evaluated at {tilde over (z)}s:
                                          J            g                    ⁡                      (                                          z                ~                            s                        )                          =                                                            ∂                g                                            ∂                z                                      ⁢                          |                                                z                  ~                                s                                              =                                    [                                                diag                  ⁡                                      (                                          1                                              1                        +                                                  ⅇ                                                                                                                    n                                ~                                                            s                                                        -                                                                                          x                                ~                                                            s                                                                                                                                            )                                                  ⁢                                                                  ⁢                                  diag                  ⁡                                      (                                          1                                              1                        +                                                  ⅇ                                                                                                                    x                                ~                                                            s                                                        -                                                                                          n                                ~                                                            s                                                                                                                                            )                                                              ]                        .                                              (        8        )            
The likelihood is
                    p        (                                            y              ⁢                                                                s                  ;                                                            z                      ~                                        s                                                  )                                      =                          𝒩              ⁡                              (                                                      μ                                          y                      ⁢                                                                                                s                          ;                                                                                    z                              ~                                                        s                                                                                                                                ,                                      Σ                                          y                      ⁢                                                                                                s                          ;                                                                                    z                              ~                                                        s                                                                                                                                              )                                              ,                                          ⁢          where                                    (        9        )                                                      μ                          y              ⁢                                                                s                  ;                                                            z                      ~                                        s                                                                                =                                    g              ⁡                              (                                                      z                    ~                                    s                                )                                      +                                                            J                  g                                ⁡                                  (                                                            z                      ~                                        s                                    )                                            ⁢                              (                                                      μ                                          z                      ⁢                                                                      s                                                                              -                                                            z                      ~                                        s                                                  )                                                    ,                                  ⁢                              Σ                          y              ⁢                                                                s                  ;                                                            z                      ~                                        s                                                                                =                      Ψ            +                                                            J                  g                                ⁡                                  (                                                            z                      ~                                        s                                    )                                            ⁢                              Σ                                  z                  ⁢                                                          s                                                              ⁢                                                                                          J                      g                                        ⁡                                          (                                                                        z                          ~                                                s                                            )                                                        ⊤                                .                                                                        (        10        )            
The posterior state probabilities are
                    p        (                              s            ⁢                                                        y                ;                                                      (                                                                  z                        ~                                                                    s                        ′                                                              )                                                        s                    ′                                                              )                                =                                                    p                (                                  y                  ⁢                                                                                s                      ;                                                                        z                          ~                                                s                                                              )                                                                                                ∑                                      s                    ′                                                  ⁢                                  p                  (                                      y                    ⁢                                                                                                                  s                          ′                                                ;                                                                              z                            ~                                                                                s                            ′                                                                                              )                                                                                            .                                              (        11        )            
The posterior mean and covariance of the speech and noise areμz|y,s;{tilde over (z)}a=μz|s+Σz|sJg({tilde over (z)}s)TΣy|s;{tilde over (z)}a−1(y−g)({tilde over (z)}s)−Jg({tilde over (z)}s)(μz|s−{tilde over (z)}s))Σz|y,s,{tilde over (z)}s=[Σz|s−1+Jg({tilde over (z)}s)TΨ−1Jg({tilde over (z)}s)]−1.  (12)
Iterative VTS updates the expansion point {tilde over (z)}s,k in each iteration k as follows.
The expansion point is initialized to the prior mean {tilde over (z)}s,1=μz|s, and is subsequently updated to the posterior mean of the previous iteration{tilde over (z)}s,k=μz|y,s;{tilde over (z)}s,k-1.
Although p(y|s;{tilde over (z)}s,k) is a Gaussian distribution for a given expansion point, the value of {tilde over (z)}s,k is the result of iterating and depends on Y nonlinearly, so that the overall likelihood is non-Gaussian as a function of y. The posterior means of the speech and noise components are sub-vectors ofμz|y,s;{tilde over (z)}s=[μx|y,s;{tilde over (z)}s;μn|y,s;{tilde over (z)}s].
The conventional method uses the speech posterior expected value to form a minimum mean-squared error (MMSE) estimate of the log spectrum:
                              x          ^                =                              ∑            s                    ⁢                      p            (                          s              ⁢                                                                y                  ;                                                            (                                                                        z                          ~                                                                          s                          ′                                                                    )                                                              s                      ′                                                                      )                            ⁢                                                μ                                      x                    ⁢                                                                                        y                        ,                                                  s                          ;                                                                                    z                              ~                                                        s                                                                                                                                              .                                                                        (        13        )            
For each frame t, the MMSE speech estimate is combined with the phase θt of the noisy spectrum to produce a complex spectral estimate,{circumflex over (X)}t=e{circumflex over (x)}t+iθt,  (14)called the VTS MMSE.