It is well known that hearing aid users face problems in understanding speech in noisy conditions.
With reference to FIG. 1, there is schematically illustrated an example of a prior art method for enhancing speech in noisy situations, where the method is based on time-frequency decomposition. The entire frequency range of interest is subdivided into a number of sub-bands, in this example into the sub-bands 2, 3, 4, 5, 6 and 7. In the example shown in FIG. 1, filter bandwidths are increasing with frequency, but this particular choice of bandwidths is irrelevant for the general idea. Prior art methods aim at estimating the highlighted noise-free spectrum 1 at a particular time instant m′ based on the noisy (observable) time-frequency coefficients.
At a given time instant m′, prior art methods aim at decomposing the power spectral density (psd) PX(k,m) of the noisy (observable) signal into a sum of the psd PS(k,m) of the clean signal and the psd PW(k,m) of the noise. Prior art methods use statistical models of the speech signals and of the noise signals. Specifically, each signal time frame is assumed to be a realization of a random vector. The probability density function (pdf) of this vector may be modelled via a statistical model, e.g. using the Generalized Method of Moments (GMM) for estimating parameters, or as exemplified in this disclosure as a dictionary of zero-mean, Gaussian pdfs, i.e., each dictionary element is a covariance matrix (since the mean vector is assumed zero). In practice, the covariance matrices of the clean signal and noise signal may be compactly represented, e.g. by using vectors of linear-prediction coefficients (under the additional assumption that the signals—in addition to being Gaussian—are outputs of an auto-regressive process). Eventually, the linear prediction coefficients may be thought of as a compact representation of the underlying psd of the signals in question. In other words, in this particular special case, the speech and noise models consist of dictionaries of typical speech and noise psd's.
The general idea behind these prior art methods is illustrated in FIG. 2 by means of a block diagram.
A noisy microphone signal x(n) picked up by microphone 8 is passed through an analysis filter bank 9 to obtain a time-frequency representation X(k,m), which is enhanced in an enhancement block 10, and transformed back to the time domain via a synthesis filter bank 11. The enhanced output signal ŝ(n) from the synthesis filter bank 11 is provided to a loudspeaker or hearing aid receiver 12. Enhancement is performed by (in the functional block 13) finding the (positive) linear combination of the noise-free psd PS(k,m) (from the speech model 14) and the noise psd PW(k,m) (from the noise model 15) that fits the observable noisy power spectral density PX(k,m) best and base the enhancement on this linear combination.
The statistical speech and noise models may consist of a dictionary of typical speech and noise psd's. However, in more advanced systems, Hidden Markov Models are used, which represent not only typical speech and noise psd's, but also their temporal evolution. The goal of prior art methods is for a given psd of the noisy (observable) signal PX(k,m), to find the combination of speech and noise psd's (i.e., elements of the speech and noise statistical models, respectively), which best corresponds to the noisy signal psd. The match between PX(k,m) and a given linear combination of elements of the speech and noise data base may be quantified in different ways, e.g., minimum mean-square error, maximum likelihood, or maximum aposteriori probability. For example, for a maximum likelihood criterion, the optimal speech and noise model psd's, P*S,i*(k,m) and P*W,j*(k,m), respectively, and their corresponding optimal scaling factors α*S and α*W, respectively, are found from the expression:
            P              W        ,                  j          *                    *        ⁡          (              k        ,        m            )        ,            P              S        ,                  i          *                    *        ⁡          (              k        ,        m            )        ,      α    S    *    ,            α      W      *        =                            arg          ⁢                                          ⁢          max                                                    P                              W                ,                j                                      ⁡                          (                              k                ,                m                            )                                ,                                    P                              S                ,                i                                      ⁡                          (                              k                ,                m                            )                                ,                      α            S                    ,                                    α              W                        >=            0                              ⁢                          ⁢              L        ⁡                  (          .          )                    where j, i are indices in the noise and speech dictionaries, respectively, and where L(.) denotes the likelihood function. Maximizing the likelihood function could e.g. be achieved by exhaustively searching the speech and noise models, i.e., for each and every combination of entries, PS,i(k,m), PW,j(k,m), k=0, . . . , K−1, of the two models, finding maximum-likelihood estimates of the scaling factors αS, αW, and, finally, for instance selecting the entry combination that leads to the largest likelihood.
The above briefly illustrated prior art methods may be efficient when the statistical speech and noise models reflect accurately the actual signals observed by the microphones of the system in real-life situations. However, this condition may be difficult to fulfil in practice. In particular, the main drawbacks of these prior art methods include:                D1: Mis-matched statistical signal models: The speech and noise signals used to train the speech and noise statistical models, respectively, must reflect the speech and noise signals recorded by the microphones in real life. However, these measured signals may be distorted, e.g., in terms of spectral tilt, by microphone mis-matches between real-life and off-line training situations, by head-shadowing effects (which is un-avoidable in hearing aid applications) that makes the measured psd's a function of sound source angle with respect to the hearing aid user, and other non-additive noise distortions, e.g., due to variable room impulse responses.        D2: They require a relatively elaborate statistical noise model, i.e., the acoustical noise situation, e.g. a car cabin situation, must be well known in advance. This requirement is generally difficult to satisfy in a hearing aid situation. It is, of course, possible to generalize the system such that is consists of a specific noise data base for any possible noise situation. This, however, requires an online noise classification algorithm (which is generally erroneous), and a large increase in memory complexity and capacity.        
Therefore, there is a need to provide a method and corresponding systems or devices that eliminate or at least reduce the above mentioned disadvantages.