Speaker verification (SV) is the process of verifying, based on a speaker's known utterances, whether an utterance belongs to the speaker. The general procedure of speaker verification consists of three phases: development, enrollment, and evaluation. For development, a background model must be created for capturing speaker-related information. In enrollment, the speaker models are created using the background model. Finally, in the evaluation, the query utterances are identified by comparing to existing speaker models created in the enrollment phase.
Speaker verification can be categorized as text-dependent or text-independent. In text-independent SV, no restriction is imposed on the utterances. In text-dependent SV, all speakers repeat the same phrase. Text-independent SV is more challenging than text-dependent SV because the system that is detecting the utterances must be able to clearly distinguish between the speaker specific and non-speaker specific characteristics of the uttered phrases. However, text-independent SV is easier to use in the real world, and thus it is a preferred approach, especially if its performance (e.g., accuracy) can match that of text-dependent SV.
Direct modeling of raw waveforms using deep neural networks (DNNs) is now prominent in literature for a number of tasks due to advances in deep learning. Traditionally, spectrogram-based features with hand-tuned parameters were used for machine learning from audio. DNNs that directly input raw waveforms have a number of advantages over conventional acoustic feature-based DNNs. For example, minimization of pre-processing removes the need for exploration of various hyper-parameters such as the type of acoustic feature to use, window size, shift length, and feature dimension.
Despite the myriad of proposed DNN architectures for speaker verification, there is a still a need for new DNN architectures for speaker verification that work with raw audio waveforms. The present invention fulfills this need.