Speaker recognition technology is to recognize the speaker's identity by using signal processing and pattern recognition. It mainly contains two procedures: speaker model training and speech evaluation.
Presently, the main features adopted for speaker recognition are the MFCC (Mel-Frequency Cepstral Coefficient), LPCC (Linear Predictive Cepstral Coefficients), PLP (Perceptual Linear Prediction). The main recognition algorithms include VQ (Vector Quantization), GMM-UBM (Gaussian Mixture Model-Universal Background Model), and SVM (Support Vector Machine) and so on. GMM-UBM is most commonly used recognition algorithm in the field of speaker recognition.
On the other hand, in speaker recognition, the speaker's training speech is usually neutral speech, because in reality application, a user under ordinary circumstance only provides a speech of neutral pronunciation or condition to train the user's model. It is not actually easy or convenient to achieve when requiring all users to provide their own speeches under all emotional states. Meanwhile, this is very high requirement to the load of system's database.
However, during actual tests, a speaker may utter speech of different emotional states, such as elation, sadness and anger and so on according to feelings at that time. Current speaker recognition algorithm cannot handle or self-adapt the mismatch between training speech and test speech, which causes the speaker recognition performance to deteriorate and the success rate of emotional speech to greatly reduce.