Many different types of computer-implemented recognition systems exist, wherein such recognition systems are configured to perform some form of classification with respect to input data. For example, computer-implemented speech recognition systems are configured to receive spoken utterances of a user and recognize words in the spoken utterances. In another example, handwriting recognition systems have been developed to receive a handwriting sample and identify, for instance, an author of the handwriting sample, individual letters in the handwriting sample, words in the handwriting sample, etc. In still yet another example, computer-implemented recognition systems have been developed to perform facial recognition, fingerprint recognition, and the like.
In many of these recognition systems, a respective recognition system is trained to recognize a relatively small number of potential labels. For instance, many speech recognition systems are trained to recognize words in a relatively small vocabulary (e.g., on the order of hundreds of words). Many conventional speech recognition systems that are configured to recognize words in a large vocabulary (on the order of thousands or tens of thousands of words) consume a significant amount of computer-readable memory, and additionally tend to require personalized training to accurately recognize words in spoken utterances. For many applications, however, it is impractical to obtain requisite training data to cause recognitions output by the recognition system to be sufficiently accurate. For instance, in a call-center scenario where words in a relatively large vocabulary are desirably recognized, it is impractical to train a speech recognition model for each individual customer, and is further impractical to ask each customer to devote several minutes to help train the speech recognition system.
Recently, deep neural networks have been studied as a potential technology that can robustly perform speech recognition without requiring a large amount of individualized training data. A deep neural network (DNN) is a deterministic model that includes an observed data layer and a plurality of hidden layers, stacked one on top of another. Each hidden layer includes a plurality of hidden units (neurons), which are coupled to other hidden units or an output label by way of weighted synapses. The output of a DNN is a probability distribution over potential labels corresponding to the input data. While DNNs have shown great promise in performing recognition tasks, improved robustness with respect to distortion, speaker identity, accent, and the like, is desired.