Speaker and speech recognition systems are becoming more and more ubiquitous, with applications ranging from access control to automated inquiry systems. The characteristics of a speaker and speech are represented by feature vectors derived from a speech utterance. Models trained on these feature vectors are then derived to serve as the template of a speaker or a speech unit. During the recognition phase, feature vectors derived from a test utterance is matched against the speaker or speech unit models and a match score is computed. The speech utterances used for training a model constitute the set of training speech. Often, one has access to only training speech corrupted with additive noise. Such training speech give rise to noisy and spurious models, which degrade the performance of the speaker and speech recognition systems.
Prior art talk about estimating clean speech vectors and corruption models which are slower in nature. Estimating clean model parameters is faster than estimating clean speech vectors as model parameters are less in number as compared to the vectors. Thereby, estimating clean speech parameters from noisy speech parameters with less computational complexity than estimating clean speech vectors for more effective speaker or speech recognition is still considered to be one of the biggest challenges of the technical field.