Speech recognition has been the subject of a significant amount of research and commercial development. For example, speech recognition systems have been incorporated into mobile telephones, desktop computers, automobiles, and the like in order to provide a particular response to speech input provided by a user. For instance, in a mobile telephone equipped with speech recognition technology, a user can speak a name of a contact listed in the mobile telephone and the mobile telephone can initiate a call to the contact. Furthermore, many companies are currently using speech recognition technology to aid customers in connection with identifying employees of a company, identifying problems with a product or service, etc.
Research in automatic speech recognition (ASR) has explored layered architectures to perform speech recognition, motivated partly by the desire to capitalize on some analogous properties in the human speech generation and perception systems. In these studies, learning of model parameters has been one of the most prominent and difficult problems. In parallel with the development in ASR research, recent progresses made in learning methods from neural network research has ignited interest in exploration of deep-structured models. One particular advance is the development of effective learning techniques for deep belief networks (DBNs), which are densely connected, directed belief networks with many hidden layers. In general, DBNs can be considered as a highly complex nonlinear feature extractor with a plurality of layers of hidden units and at least one layer of visible units, where each layer of hidden units learns to represent features that capture higher order correlations in original input data.
While DBNs have been shown to be powerful in connection with performing recognition/classification tasks, training DBNs has proven to be somewhat difficult. In particular, conventional techniques for training DBNs involve the utilization of a stochastic gradient descent learning algorithm. While this learning algorithm has been shown to be powerful in connection with fine-tuning weights assigned to a DBN, such learning algorithm is extremely difficult to parallelize across machines, causing learning to be somewhat tedious.