Traditionally, a universal background model (UBM) is used to analyze acoustic signals for speaker recognition. The UBM outputs numerical acoustic indices that do not correspond to the phonetic or lexical content of the input speech signal. Speech content and the distortions it produces in the acoustic signal have been largely ignored in prior work on text-independent speaker verification.
A deep neural network (DNN) is a feed-forward neural network that is both much larger (e.g., a few thousand nodes per hidden layer) and much deeper (e.g., 5-7 hidden layers) than traditional neural networks. The application of DNNs in other fields can be straightforward if each output node of the DNN represents one of the classes of interest. Efforts are being made to adapt DNNs for speech recognition. However, applying DNNs directly to speaker recognition is much more challenging due to the limited amount of speaker-specific training data and the uncertainty of the speaker's identity.