Speech recognition involves translating spoken words into text. As a result of advancements in speech recognition over the last decade, speech recognition is now used in a growing number of applications and services. For example, voice-enabled search systems utilize speech recognition to allow a user to search the Internet or control functions of an automobile using a voice command.
One approach for performing computerized speech recognition involves using an acoustic model to model the relationship between an acoustic feature vector and a phoneme, a basic unit of a language's phonology. Acoustic feature vectors are numerical representations of acoustic speech that are determined by sampling and filtering speech data. For each feature vector, an acoustic model may output a statistical representation (e.g., a joint probability distribution) indicating a likelihood that the feature vector corresponds to a particular phoneme. From the likelihoods output by the acoustic model, a decoder may then be used to classify observed sequences of phonemes as one or more words.
Multiple types of acoustic models for speech recognition exist. In many examples, Hidden Markov Models (HMMs) are used to model the sequential structure of speech signals. For instance, each HMM state may use a Gaussian mixture model (GMM) to determine how well each state of each HMM fits a feature vector or sequence of feature vectors. An alternative way to evaluate the fit is to use a feedforward neural network that receives a sequence of feature vectors as input and derives posterior probabilities over HMM states as output.
Traditionally, languages and dialects within languages are considered independently. For each language (and each dialect), a separate acoustic model is trained from scratch. The language-specific acoustic model is then utilized to recognize speech that is spoken in that particular language.