Automatic Speaker recognition is the problem of recognizing the identity of a human speaker of a given speech signal of some specific duration by a machine. This problem falls into two broad categories: text-dependent and text-independent, depending on the dependence on whether the input speech signal is constrained to be from a specific text (e.g. a password text) or not constrained to be of any specific text, i.e., the speech signal can be of any text (content) in any language. In a different definition, speaker recognition comprises two different problems: speaker-identification and speaker-verification. The speaker-identification problem is a multi-class problem of having to determine the identity of the input speech signal as one of many speakers who are enrolled into a system. Speaker-verification is essentially a binary classification problem of determining whether the input signal is spoken by a claimed identity (of a speaker) or not—in the process yielding a Yes/No decision (i.e., of deciding whether the input signal is from the claimed identity (termed target speaker) or from a person other than the claimed identity—termed an impostor).
Traditionally, speaker recognition systems have been built using short-time acoustic feature vectors (typically, MFCCs (mel-frequency cepstral coefficients)) viewed as a bag-of-vectors framework, and designing GMM (Gaussian mixture model) based speaker modelling and GMM-UBM (GMM-Universal Background Model) based background speaker modelling as specified in J. P. Campbell. Speaker recognition: A tutorial. Proc. IEEE, vol. 85, no. 9, pp. 1437-1462, September 1997, and J. H. L. Hansen and T. Hasan. Speaker recognition by machines and humans: A tutorial review. IEEE Signal Process. Mag., vol. 32, no. 6, pp. 74-99, November 2015, which are incorporated by reference herein. In this approach, the background model (UBM Universal Background Model) is designed on a large set of speakers and adapted to a specific speaker (whether it is for speaker identification or speaker verification) to yield the speaker-specific GMM (Gaussian Mixture Model) via MAP (Maximum a-posteriori) adaptation techniques. Speaker-identification of a test signal is a multi-class classification problem, of deciding which of the N speaker models yields the highest likelihood of a test utterance (a collection of feature vectors). In speaker-verification, a likelihood ratio test yields the decision of whether the input test vectors are likely to be closer to claimant speaker (target) model or to the background model.
Alternately, in a progression from the above framework, the i-vector/PLDA (probabilistic linear discriminative analysis) approach evolved, which has yielded state-of-art performances for a long period of time (until now), and which has now been progressively replaced by end-to-end approaches. In the i-vector/PLDA (probabilistic linear discriminative analysis) approach, a super vector is extracted by stacking the mean vectors of the speaker-adapted GMM and projecting this super vector onto a total variability space to extract a low dimensional vector called the i-vector (identity vector) of that speaker as shown in Najim Dehak, Patrick J Kenny, Reda Dehak, Pierre Dumouchel and Pierre Ouellet. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2011; and Simon J D Prince and James H Elder. Probabilistic linear discriminant analysis for inferences about identity. Proc. of International Conference on Computer Vision, 2007, which are incorporated by reference herein. Once such an i-vector is extracted as a ‘representation’ of the target speaker, a further discriminative modelling is done, such as the probabilistic linear discriminative analysis (PLDA) to handle channel or session variability. In the testing or verification stage, a decision as to whether the input test utterance is from the claimant speaker or not is made through the PLDA score or alternately, by computing the distance between the i-vectors extracted during the enrollment stage and the verification phase.
In summary, conventional techniques are based on:
(a) Short-term feature representation of the input speech signal, based on a spectral feature vector—namely the MFCCs (mel-frequency-cepstral-coefficients)—these are called hand-crafted features—as they are specified by signal processing techniques making use of prior knowledge of speech production/perception mechanisms.
(b) A speaker-GMM/background GMM-UBM modeling followed by likelihood based multi-class classification (for speaker-identification) and hypothesis—testing as a likelihood—ratio testing for speaker verification.
(c) An i-vector/PLDA approach as a further evolution from the above GMM-UBM approach being able to handle session—variability and channel—variability using results from joint-factor analysis.
There are several known prior art efforts towards designing deep-learning and end-to-end architectures for speaker-recognition, specifically, using the CNN (convolution neural network) framework as a front-end representation learning mechanism described as follows:
(1). Text-dependent speaker-verification results have been reported using deep neural networks (DNNs) and recurrent neural networks (RNNs) for speaker discriminative or phonetic discriminative network training; here, intermediate frame-level features such as d-vectors as disclosed in G. Heigold, I. Moreno, S. Bengio, and N. Shazeer. End-to-end text dependent speaker verification. Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., 2016, pp. 5115-5119; and E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez. Deep neural networks for small footprint text-dependent speaker verification. Proc. IEEE Int. Conf. Acoust., Speech Signal Processing, 2014, pp. 4052-4056, which are incorporated by reference herein; bottleneck activations or phonetic alignments are extracted to formulate utterance-level speaker representations as disclosed in F. Richardson, D. Reynolds, and N. Dehak. Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett., vol. 22, no. 10, pp. 1671-1675, October 2015; and Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren. A novel scheme for speaker recognition using a phonetically-aware deep neural network. Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2014, pp. 1695-1699, incorporated by reference herein.
(2) DNNs (Deep Neural Networks), RNNs (Recurrent Neural Networks) and convolution neural networks (CNNs) with an end-to-end loss have been proposed to discriminate between the same-speaker and different-speaker pairs for global keyword (e.g., ‘OK Google’ and ‘Hey Cortana’) speaker verification tasks, and shown to achieve better performance compared with conventional techniques such as GMM-UBM or i-Vector/PLDA, G. Heigold, I. Moreno, S. Bengio, and N. Shazeer. End-to-end text dependent speaker verification. Proc. IEEE Int. Conf. Acoust. Speech, Signal Process., 2016, pp. 5115-5119; and S. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong. End-to-end attention based text-dependent speaker verification. Proc. IEEE Workshop Spoken Lang. Technol., 2016, pp. 171-178, incorporated by reference herein.
(3) Deep learning frameworks with end-to-end loss functions to train speaker discriminative embeddings include the work of Snyder et al. and Garcia et al. as disclosed in D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel and S. Khudanpur. Deep neural network based speaker embedding for end-to-end speaker verification. Proc. IEEE SLT, 2016; and D. Garcia-Romero, D. Snyder, G. Sell, D. Povey and A. McCree. Speaker diarization using deep neural network embeddings. Proc. IEEE Int. Conf. Acoust., Speech Signal Processing, 2017, pp. 4930-4934, incorporated by reference herein; who show that deep neural networks with an end-to-end similarity metric or DNN (Deep Neural Networks) based speaker embedding could outperform the i-Vector baselines.
(4) Several work—termed end-to-end realizations—actually start with a ‘spectrogram’ representation of the speech signal (derived via STFT (Short-time Fourier Transform) or mel filter bank (MFB) frame-level spectral representations) or a sequence of short-term feature vectors stacked in time followed by aggregation strategies to get a score over the entire duration of the input utterance, and perform deep CNN representation learning as disclosed in Mitchell McLaren, Yun Lei, Nicolas Scheffer, Luciana Ferrer, Application of convolutional neural networks to speaker recognition in noisy Conditions, Interspeech 2014, pp. 686-690, Singapore, 2014; Shi-Xiong Zhang, Zhuo Cheny, Yong Zhao, Jinyu Li and Yifan Gong, End-to-end attention based text-dependent speaker verification, Proc. IEEE SLT 2016, pp. 171-178, 2016; Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan and Zhenyao Zhu, Deep Speaker: an End-to-End Neural Speaker Embedding System, arXiv:1705.02304v1 [cs.CL] 5 May 2017; Chunlei Zhang, Kazuhito Koishida, End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances, Interspeech 17, pp. 1487-1491, Stockholm, Sweden, 2017; Joon Son Chungy, Arsha Nagraniy, Andrew Zisserman, VoxCeleb2: Deep Speaker Recognition, arXiv: 1806.05622v2 [cs.SD] 27 Jun. 2018; Hossein Salehghaffari, Speaker Verification using Convolutional Neural Networks, arXiv:1803.05427v2 [eess.AS] 10 Aug. 2018; Mandi Hajibabaei, Dengxin Dai, Unified hypersphere embedding for speaker recognition, arXiv:1807.08312v1 [eess.AS] 22 Jul. 2018; Migshwan Wang et al., Speaker recognition using convolutional neural network with minimal training data for smart home solutions, pp. 139-145, 2018; Amirsina Torfi, Jeremy
Dawson Nasser M. Nasrabadi, Text-independent speaker verification using 3D convolutional neural net-works, arXiv:1705.09422v7 [cs.CV] 6 Jun. 2018; Chunlei Zhang, Kazuhito Koishida and John H. L. Hansen, Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings, IEEE/ACM Trans. on Audio, Speech and Language Processing, vol. 26, no. 9, pp. 1633-1644, September 2018; Gautam Bhattacharya, Jahangir Alam, Patrick Kenny, Deep Speaker Recognition: Modular or Monolithic?, Interspeech 19, pp. 1143-1147, Graz, Austria, 2019; Yiheng Jiang, Yan Song, Ian McLoughlin, Zhifu Gao, Lirong Dai, An Ef-fec-tive Deep Embedding Learning Architecture for Speaker Verification, Interspeech 19, pp. 4040-4044, Graz, Austria, 2019; and Sarthak Yadav, Atul Rai, Frequency and temporal convolutional attention for text-independent speaker recognition, arXiv:1910.07364v2 [cs.SD] 19 Oct. 2019; and all of the previous are incorporated by reference herein.
(5) Among the previous references, one work that distinguishes itself in being ‘end-to-end’ from a ‘raw speech waveform’ is that of Hannah Muckenhirn, Mathew Magimai-Doss, Sebastien Marcel. Towards directly modeling raw speech signal for speaker verification using CNNs. Proc. ICASSP 18, pp. 4884-4888, 2018, incorporated herein by reference, which proposes a CNN architecture for learning representations that are further processed for further SID (speaker identification) or SV (speaker verification). But this CNN architecture is a conventional CNN—though it proposes to work in a truly end-to-end manner from raw speech waveform, and employs two layers of convolutions—one ‘followed’ by the other in ‘tandem’ (i.e., as a cascade)—with each having different kernel sizes dictated more by the dimensions of the data involved in each layer. CNN architectures have been used in the prior art for a wide variety of problems, including other speech processing tasks (other than speaker-recognition) as shown as follows:
(i). Starting from the early introduction of the convolutional neural-network (CNN) by Le Cun disclosed in Y. LeCun et al. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, vol. 1, pp. 541-551, 1989; for successful recognition of handwritten digital images, CNNs have come to be a well-established framework for end-to-end approaches (i.e. from raw input), combining a powerful representational learning mechanism as disclosed in Y. Bengio, A. Courville, P. Vincent Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, issue 8, pp. 1798-1828, August 2013, incorporated herein by reference in its lower convolution layers and the discriminative fully-connected higher layers for multi-class classification tasks such as from raw images as disclosed in Alex Krizhevsky, Ilya Sutskever, Hinton, Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks Communications of the ACM. 60 (6): 8490, June 2017, incorporated herein by reference; audio/speech spectrographic images as disclosed in Ossama Abdel-Hamid, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn and Dong Yu, Convolutional Neural Networks for Speech Recognition, IEEE/ACM Trans. on Audio, Speech and Language Processing, vol. 22, no. 10, pp. 1533-1545, October 2014; Shawn Hershey et al. CNN architectures for large-scale audio classification. Proc. ICASSP '17, New Orleans, 2017, incorporated herein by reference, speech-waveform as disclosed in D. Palaz, R. Collobert, R. Magimai-Doss, Analysis of CNN based speech recognition system using raw speech as input. Proc. Interspeech '15, Dresden, 2015; and Tara N. Sainath, Ron J. Weiss, Andrew W. Senior, Kevin W. Wilson and Oriol Vinyals Learning the speech front-end with raw waveform CLDNNs. Proc. Interspeech 15, Dresden, 2015 incorporated herein by reference; audio-waveform as disclosed in Wei Dai, Chia Dai, Shuhui Qu Juncheng Li Samarjit Da. Very deep convolutional neural networks for raw waveforms. Proc. ICASSP 17, New Orleans, L A, 2017; and Tokozume, Y., Harada, T. Learning environmental sounds with end-to-end convolutional neural network. Proc. ICASSP '17. New Orleans, L A, 2017, incorporated herein by reference; music-waveform as disclosed in Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim and Nam, Juhan. Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. Proc. 14th Sound and Music Computing Conference, pp. 220226, Espoo, Finland, 2016; and Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim and Juhan Nam. SampleCNN: End-to-End Deep Convolutional Neural Networks Using Very Small Filters for Music Classification. Appl. Sci., 8, 150, 2018, incorporated herein by reference.
(ii) A work that comes close to handling multi-scale properties disclosed in A. Schindler, T. Lidy, A. Rauber. Multi-temporal resolution convolutional neu-ral networks for acoustic scene classification. Detection and Classification of Acoustic Scenes and Events 2017, November 2017, Munich, Germany, incorporated herein by reference considers a ‘parallel CNN architecture’ with two branches with two different Two dimensional kernels, each designed to capture temporal and frequency relationship in an image-like 80×80 input of a log-amplitude transformed Mel-Spectrogram with 80 Mel-bands spectral and 80 STFT (Short-time Fourier Transform) frames temporal resolution. However, it does not directly address the issue of time-frequency trade-offs from raw one-dimensional waveform. Likewise, the recent work of Shawn Hershey et al. CNN architectures for large-scale audio classification. Proc. ICASSP '17, New Orleans, 2017, incorporated by reference herein applies a class of conventional CNN architectures for the audio-scene classification task for a comparative study with the input being log-mel spectrogram patches of 96×64 bins and not on raw waveforms.
Prior Art that Provides CNNs with Variable Kernel-Sizes.
The closest treatments in literature to the notion of using variable kernel sizes, are the following:
(a) In image-CNN literature: This is in the now well-known Inception network (or the GoogleNet) as disclosed in Christian Szegedy et al. Going Deeper with Convolutions. Proc. CVPR 2014, incorporated by reference herein, where multiple image kernels of sizes 1×1, 3×3 and 5×5 have been used in the initial CNN layers. However, the motivation for providing for these variable sized kernels has been very different from the fundamental time-frequency (spatial intensity variation vs spatial frequency in the case of images) trade-off.
(b) For automatic speech recognition (ASR): The work of Zhu et al. (2016) as disclosed in Zhenyao Zhu, Jesse H. Engel and Awni Hannun. Learning Multiscale Features Directly From Waveforms. arXiv:1603.09509v2 [cs.CL] 5 Apr. 2016, incorporated by reference herein, applies multiple kernel sizes in a multi-scale CNN architecture with the objective of performing a multi time-frequency resolution analysis for automatic speech recognition (ASR).
(c) For Audio scene classification (ASC): In a more recent work, disclosed in Boqing Zhu, Changjian Wang, Feng Liu, Jin Lei, Zengquan Lu, Yuxing Peng. Learning Environmental Sounds with Multi-scale Convolutional Neu-ral Network. Proc. IJCNN 2018, (also arXiv:1803.10219v1, March 2018); and Boqing Zhu, Kele Xu, Dezhi Wang, Lilun Zhang, Bo Li, Yuxing Peng, Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features. arXiv:1805.09752v2, June 2018, incorporated by reference herein, addresses this issue and propose a multi-temporal architecture for audio-scene classification (ASC), taking into account the need for a variable time-frequency representational analysis of the one dimensional signal such as audio-signal for the ASC task.
CNNs with Variable Kernel ‘Band-Widths’
In another variant of CNNs which allow variable-kernel ‘bandwidths’ (in the frequency domain, where the kernel is interpreted as a bandpass filter), [Ravanelli and Bengio, 2018, 2019] M. Ravanelli and Y. Bengio, “Speaker Recognition from raw waveform with SincNet”, Proc. of SLT, 2018 and M. Ravanelli and Y. Bengio, “Interpretable Convolutional Filters with SincNet”, 32nd Conference on Neural Information Processing Systems (NIPS 2018) IRASL workshop, Montréal, Canada, propose the ‘SincNet’—which constrains the kernel to be a sinc function, parameterized by the lower and higher cut-off frequencies of the corresponding ‘rectangular’ bandpass filter. By this SincNet proposes to use up to 80 filters in a convolutional layer—each filter of fixed length in time-domain, but capable of learning variable band-width band-pass filters by making the lower- and higher-cutoff frequencies of the sinc function as the parameters to be learnt.
Various prior art in this domain can be outlined as follows:
(a) Prior-art of type one: Related to conventional approaches to speaker-recognition, namely, i) GMM-UBM and ii) i-vector/PLDA approaches.
(b) Prior art of type two: Related specifically to the use of CNN architectures (some of them presumably termed end-to-end) for ‘speaker recognition’ problems—as a recent progression from the above set of more conventional techniques.
(c) Prior art of type three: Related to the use of CNN architectures in various tasks.
(d) Prior art of type four: Related to specific CNN architectures that employ variable-sized kernels as we do here.
(e) Prior art of type five: Related to specific CNN architectures that employ variable-bandwidth (but fixed-sized) kernels.