Recognition of speech has been dealt with in a number of setups and for a number of different purposes using a variety of approaches and methods. The present application relates to the concept of time-frequency masking, which has been used to separate speech from noise in a mixed auditory environment. A review of this field and its potential for hearing aids is provided by [Wang, 2008].
US 2008/0183471 A1 describes a method of recognizing speech comprising providing a training database of a plurality of stored phonemes and transforming each phoneme into an orthogonal form based on singular value decomposition. A received audio speech signal is divided into individual phonemes and transformed into an orthogonal form based on singular value decomposition. The received transformed phonemes are compared to the stored transformed phonemes to determine which of the stored phonemes most closely correspond to the received phonemes.
[Srinivasan et al., 2005] describes a model for phonemic restoration. The input to the model is masked utterances with words containing masked phonemes, the maskers used being e.g. broadband sound sources. The masked phonemes are converted to a spectrogram and a binary mask of the spectrogram to identify reliable (i.e. the time-frequency unit containing predominantly speech energy) and unreliable (otherwise) parts is generated. The binary mask is used to partition the spectrogram into its clean and noisy parts. The recognition is based on word-level templates and Hidden Markov model (HMM) calculations.