1. Technical Field
The present invention relates to audio classification and more particularly to systems and methods for addressing mismatch in utterances due to equipment or transmission media differences.
2. Description of the Related Art
Speaker recognition and verification is an important part of many current systems for security or other applications. However, under mismatched channel conditions, for example, when a person enrolls for a service or attempts to access their account using an electret handset but wishes to be verified when using a cell phone, there is significant mismatch between these audio environments. This results in severe performance degradation.
Some of the solutions to date include Speaker Model Synthesis (SMS), Feature Mapping (FM) and Intersession Variation Modeling (ISV) and channel specific score normalization. A drawback of these methods includes that SMS and FM perform a model/feature transformation based on a criterion that is unrelated to the core likelihood ratio criterion that is being used to score the result. ISV does not assume discrete channel classes, and score normalization does not directly account for channel mismatch.
Previous work in addressing the channel mismatch problem is similar in that either the features or model parameters are transformed according to some criterion. For example, the SMS technique was a model transformation technique. The SMS technique performed speaker model transformations according to the parameter differences between MAP adapted speaker background models of different handset types.
Some work in the area of speech recognition, although not directly addressing the channel mismatch problem, is also worthy of mention. It examined constrained discriminative model training and transformations to robustly estimate model parameters. Using such constraints, speaker models could be adapted to new environments. Another approach, termed factor analysis, models the speaker and channel variability in a model parameter subspace. Follow up work showed that modeling intersession variation alone provided significant gains in speaker verification performance.
There are several schemes that address channel mismatch from the perspective of feature transformation schemes. One study utilized a neural network to perform feature mapping on an incoming acoustic feature stream to minimize the effect of channel influences. There were no explicit channel specific mappings applied on this occasion. Another technique involved performed feature mapping based on detecting the channel type and mapping the features to a neutral channel domain. This technique mapped features in a similar manner that SMS transforms model parameters. For speech recognition, a piecewise Feature space Maximum Likelihood Linear Regression (fMLLR) transformation is applied to adapt to channel conditions. No explicit channel information is exploited.