Voiceprint recognition (VPR) is a kind of biological recognition technology, also known as speaker recognition. Speaker recognition includes two categories, one is speaker identification, and the other is speaker verification. Speaker identification is used to determine which one of several people has produced a particular speech segment. This is a question of choosing one from multiple alternatives. Speaker verification is used to confirm whether a certain speech is produced by a specified person or not. This is a question of “one-to-one differentiation.
VPR includes a text-dependent variety and text-independent variety. A text-dependent VPR system requires the users to speak content selected according to specific rules, and the voiceprint model of every person is established accurately one by one; and a speaker shall also speak the specified content when recognition is performed, such that the system can produce a better recognition result. However, the system needs the cooperation of users; if the speech of a user is inconsistent with the specified content, the system will not be able to properly recognize this user. A text-independent VPR system does not stipulate on the content of speech produced by a speaker, and the model establishment is relatively difficult. However, it is more convenient for users to use, and has a wide range of applications.
In the conventional speaker recognition technology, all of the mainstream recognition systems use the characteristics of the spectral base, such as Mel-Frequency Cepstral Coefficients (MFCC), Perceptual Linear Predictive (PLP) analysis, Linear Predictive Cepstral Coefficients (LPCC), etc. These all come from the relatively visual spectrograms and are easily affected by various kinds of noises. However, in the application scenario of speaker recognition technologies, the collected speech data is unlikely to be clean, the types of noise contained in the speech data are complex, and signal to noise ratio is very poor. If conventional fundamental spectral base characteristics are used, a large amount of noise compensation needs to be applied on the feature extraction side, modeling side, and scoring side upon the feature extraction. This means larger computation complexity and latency, and cannot completely eliminate the noise effect.