A speech recognition system analyzes a user's speech to determine what the user said. Most speech recognition systems are frame-based. In a frame-based system, input speech is processed into a sequence of digital speech feature frames. Each speech feature frame can be thought of as a multi-dimensional vector that represents various characteristics of the speech signal present during a short time window of the input speech.
The speech recognition system compares the input speech frames to find statistical models that best match the speech feature characteristics and then determines a corresponding representative text or semantic meaning associated with the statistical models. Modern statistical models are state sequence models, such as Hidden Markov Models (HMMs), that model speech sounds (usually phonemes) using mixtures of Gaussian distributions.
Many speech recognition systems use discriminative training techniques that are speech recognition techniques that dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of training data. Examples of such discriminative training techniques are Deep Neural Network (DNNs).
A DNN is a feed-forward, artificial network that has more than one layer of hidden units between its inputs and its outputs. DNNs with many hidden layers and many units per layer are very flexible models with a very large number of parameters. This makes them capable of modeling very complex and highly non-linear relationships between inputs and outputs, which is important for high-quality acoustic modeling.