Many different types of computer-implemented recognition systems exist, wherein such recognition systems are configured to perform some form of classification with respect to input data set forth by a user. For example, computer-implemented speech recognition systems are configured to receive spoken utterances of a user and recognize words in the spoken utterances. In another example, handwriting recognition systems have been developed to receive a handwriting sample and identify, for instance, an author of the handwriting sample, individual letters in the handwriting sample, words in the handwriting sample, etc. In still yet another example, computer-implemented recognition systems have been developed to perform facial recognition, fingerprint recognition, and the like.
With more particularity with respect to speech recognition, such type of recognition has been the subject of a significant amount of research and commercial development. For example, automatic speech recognition (ASR) systems have been incorporated into mobile telephones, desktop computers, automobiles, gaming consoles, customer service centers, etc., in order to recognize commands/questions and provide an appropriate response to such commands/questions. For instance, in a mobile telephone equipped with an ASR system, a user can utter a name of a contact retained in a contacts list on the mobile telephone, and the mobile telephone can initiate a call to the contact.
Even after decades of research, however, the performance of ASR in real-world usage scenarios remains far from satisfactory. Conventionally, hidden Markov models (HMMs) have been the dominant technique for larger vocabulary continuous speech recognition (LVCSR). In conventional HMMs used for ASR, observation probabilities for output states are modeled using Gaussian mixture models (GMMs). These GMM-HMM systems are typically trained to maximize the likelihood of generating observed features in training data. Recently, various discriminate strategies and large margin techniques have been explored. The potential of such techniques, however, is restricted by limitations of the GMM emission distribution model.
More recent research in ASR has explored layered architectures to perform speech recognition, motivated partly by the desire to capitalize on some analogous properties in the human speech generation and perception systems. In these studies, learning of model parameters (weights and weight biases corresponding to synapses in such layered architectures) has been one of the most prominent and difficult problems. In parallel with the development in ASR research, recent progresses made in learning methods from neural network research have ignited interest in exploration of deep neural networks (DNNs). A DNN is a densely connected directed belief network with many hidden layers. In general, DNNs can be considered as a highly complex, nonlinear feature extractor with a plurality of layers of hidden units and at least one layer of visible units, where each layer of hidden units is learned to represent features that capture higher-order correlations in original input data.
Conventionally, ASR systems that utilize DNNs are trained to be speaker/channel independent. In other words, parameters (e.g., weights and weight biases) of the DNN are not learned with respect to a particular speaker and/or channel. This is for at least two reasons: first, it is often difficult to obtain a sufficient amount of training data to robustly learn the parameters for a speaker and/or channel, as most users do not desire to spend a significant amount of time providing labeled utterances to train an ASR system. Furthermore, DNNs typically have many more parameters due to wider and deeper hidden layers, and also have a much larger output layer that is designed to model senones directly. This makes adapting a DNN utilized in connection with speech recognition a relatively difficult task.