Current state-of-the art approaches to speaker recognition are based on a universal background model (UBM) estimated using either acoustic Gaussian mixture modeling (GMM) or phonetically-aware deep neural network architecture. The most successful techniques consist of adapting the UBM model to every speech utterance using the total variability paradigm. The total variability paradigm aims to extract a low-dimensional feature vector known as an “i-vector” that preserves the total information about the speaker and the channel. After applying a channel compensation technique, the resulting i-vector can be considered a voiceprint or voice signature of the speaker.
One drawback of such approaches is that, in programmatically determining or verifying the identity of a speaker by way of a speech signal, a speaker recognition system may encounter a variety of elements that can corrupt the signal. This channel variability poses a real problem to conventional speaker recognition systems. A telephone user's environment and equipment, for example, can vary from one call to the next. Moreover, telecommunications equipment relaying a call can vary even during the call.
In a conventional speaker recognition system a speech signal is received and evaluated against a previously enrolled model. That model, however, typically is limited to a specific noise profile including particular noise types such as babble, ambient or HVAC (heat, ventilation and air conditioning) and/or a low signal-to-noise ratio (SNR) that can each contribute to deteriorating the quality of either the enrolled model or the prediction of the recognition sample. Speech babble, in particular, has been recognized in the industry as one of the most challenging noise interference due to its speaker/speech like characteristics. Reverberation characteristics including high time-to-reverberation at 60 dB (T60) and low direct-to-reverberation ratio (DRR) also adversely affect the quality of a speaker recognition system. Additionally, an acquisition device may introduce audio artifacts that are often ignored although speaker enrollment may use one acquisition device while testing may utilize a different acquisition device. Finally, the quality of transcoding technique(s) and bit rate are important factors that may reduce effectiveness of a voice biometric system.
Conventionally, channel compensation has been approached at different levels that follow spectral feature extraction, by either applying feature normalization, or by including it in the modeling or scoring tools such as Nuisance Attribute Projection (NAP) (see Solomonoff, et al., “Nuisance attribute projection”, Speech Communication, 2007) or Probabilistic Linear Discriminant Analysis (PLDA) (see Prince, et al., “Probabilistic Linear Discriminant Analysis for Inferences about Identity”, IEEE ICCV, 2007).
A few research attempts have looked at extracting channel-robust low-level features for the task of speaker recognition. (See, e.g., Richardson et al. “Channel compensation for speaker recognition using MAP adapted PLDA and denoising DNNs”, Proc. Speaker Lang. Recognit. Workshop, 2016; and Richardson, et al. “Speaker Recognition Using Real vs Synthetic Parallel Data for DNN Channel Compensation”, INTERSPEECH, 2016.) These attempts employ a denoising deep neural network (DNN) system that takes as input corrupted Mel frequency cepstrum coefficients (MFCCs) and provides as output a cleaner version of these MFCCs. However, they do not fully explore the denoising DNN by applying it directly to the audio signal. A significant portion of relevant speaker-specific information is already lost after MFCC extraction of the corrupted signal, and it is difficult to fully cover this information by the DNN.
Other conventional methods explore using phonetically-aware features that are originally trained for automatic speech recognition (ASR) tasks to discriminate between different senones. (See Zhang et al. “Extracting Deep Neural Network Bottleneck Features using Low-rank Matrix Factorization”, IEEE ICASSP, 2014). Combining those features with MFCCs may increase performance. However, these features are computationally expensive to produce: they depend on a heavy DNN-based automatic speech recognition (ASR) system trained with thousands of senones on the output layer. Additionally, this ASR system requires a significant amount of manually transcribed audio data for DNN training and time alignment. Moreover, the resulting speaker recognition will work only on the language that the ASR system is trained on, and thus cannot generalize well to other languages.