Automatic speech recognition (ASR) is the process of recognizing spoken language and translating the spoken language into the corresponding text. Typically a weighted finite state transducer (WFST) based model is used to translate the input audio to an output text. A finite state transducer is a finite state machine (FSM) that accepts an input label on a transition/arc and can also output an output label on the same arc. A weighted finite state transducer further adds a weighted score to each transition/arc. Such a WFST is able to translate various input labels into output labels according to the rules set within its states and transitions. Thus, this WFST takes as input acoustic units of the speech (like phonemes or HMM states) and outputs as text the text of this speech.
The WFST model for ASR is typically split into various components that model the various intermediate steps needed to translate the spoken audio. An acoustic model is usually used to translate the spoken audio into phones, a lexical model is used to convert phones to words, and a language model is used to combine these words into sentences. A phone represents a small acoustic component of spoken text that represents one of the standard sounds of a language (e.g., the word “want” may be represented by the phones “w ao n t”). As the system only knows the audio observations of the spoken audio, it cannot know with absolute certainty the phones that correspond to these audio features. Instead, the system attempts to find the most likely text given the audio observations of the speech. Such a model, which attempts to predict the most likely states that generate the observations, is typically based on Hidden Markov Models (HMMs).
An HMM is a FSM with state transition probabilities and emissive (or observation) probabilities. A state transition probability of one state to another state represents the probability of transition from the one state to the other. An emissive probability for an observation is the probability that a state will “emit,” or generate, a particular observation. These probability values may be discovered for a particular system by a training process that uses training data. This training data includes observations along with the known states that generated these observations. After training, a decoding process, using a set of new observations, may traverse through the HMM to discover the most likely set of states that generated these observations. For example, after an HMM modeling the acoustic features to phones of a language has been trained, a decoding process may be used on a new set of audio (i.e., spoken words/sounds) to discover the most likely states and transitions that generated these observations. These states and transitions are associated with various phones in the language. If using a WFST, the input labels of this HMM WFST would be the acoustic features (the observations) the output labels would be the phones, and the weights of each transition would be the state transition probabilities.
Although there may not be a large number of phones in a particular language (e.g., English is typically modeled with 40 phones), human speech patterns can vary greatly based on context, intonation, prosody, etc. Thus, in order to model speech more accurately, the model will typically account for these various factors. For example, HMMs in ASR typically model the phone in the context of its surrounding phones (e.g. a tri-phone model) as the pronunciation of a phone may differ based on the context that it is in. Furthermore, each phone is typically modeled with a beginning, middle, and end state. Thus, the number of states needed to model a complete set of phones in a language can easily reach into the thousands.
Furthermore, the WFST modeling the audio features to phones is also combined with the WFST models that determine the most likely words given the a set of potential phones (lexical model), and the most likely sentences given a set of potential words (language model). These WFSTs are also trained using various data sets as well (e.g. a pronunciation dictionary, a language corpus).
This composition of these component WFSTs creates a very large WFST that has acoustic features as inputs and text as output. Such a WFST may easily include hundreds of thousands of states, and finding the most likely path within such a WFST in order to discover the most likely sequence of words given an audio sequence can easily become very computationally expensive.
Many methods are used to optimize this expensive process of finding the most likely path in the WFST given the observations. For example, instead of a brute force search, a method known as Viterbi decoding, using the Viterbi algorithm, is used to determine the most likely path in the WFST. The Viterbi algorithm traverses through the WFST, and for each observation (i.e. iteration), it calculates the probability of each transition/arc originating from an old state to new states given the observation (i.e., P(old state)·P(transition from old to new state)·P(observation|new state), where P(old state) is the accumulated probability score of the path taken as calculated in the previous iteration; some systems may use the log of these values, which simplifies multiplicative operations to additive). As used here, old and new states refer to the states the algorithm is considering during one iteration of the algorithm, and not the actual states in the FSM (i.e., an old and new state could be the same state). After the algorithm computes the scores for each transition from all old states, for each new state, the algorithm selects the path from any old state that has the best score. These steps are repeated until all the observations are processed. At this point, the path with the best final accumulated scored is the most likely path given the observations, and the output labels along this path are the set of most likely output labels given the observations made.
However, although plain Viterbi decoding provides an improvement to brute force searching, given the very large WFST needed to model a language, Viterbi decoding is still computationally expensive, especially for low power devices.
Thus, further optimizations can be made. For example, one such optimization is beam pruning. Beam pruning shrinks the search space by discarding those paths in the current iteration that are worse than the best score by a significantly large threshold value. Using beam pruning, the decoding process does not need to traverse those paths with current scores that are unlikely to eventually yield the path with the best score. However, in order to find the best score, the beam pruning method must first find the best score of all the arcs in the current iteration before being able to then discard those paths that are worse than the best score by a threshold value. This requires that the “candidate” paths and their scores be stored before potentially being discarded.
Another optimization is to compose or combine the different parts of the entire WFST model dynamically during the decoding process. When combined statically, states in a “left” WFST are joined to states in a “right” WFST using transitions or arcs according to FSM composition rules. The “left” WFST outputs symbols that correspond to the input labels of the “right” WFST. Thus, the composition creates a unified WFST that may translate between the input labels of the “left” WFST, and arrive at the output labels of the “right” WFST. However, this composition has the potential of creating a composed WFST that is very large. Instead, dynamic composition allows for the “right” WFST to be joined to the “left” WFST after the “left” WFST has been decoded. This may save a significant amount of space.
However, this creates additional computational complexity as the searching of arcs in the “right” WFST must be done on the fly. To improve the speed of this operation, this search is facilitated with the use of an arc cache, which caches mappings of states and input labels to arcs. For example, the input label at a state ID of 2 may be “3”. Upon a hit in the arc cache for this state and input label, the system may determine that the corresponding arc ID is 4 (i.e., from state 2, arc 4 accepts the input label “3”). Thus, instead of searching through the “right” WFST repeatedly, the system may simply consult the arc cache.
However, WFSTs may include “epsilon” arcs, typically to model backoff, which is a method of estimating a score for those inputs that are not modeled or available in the WFST. These are arcs do not accept an input label (they accept an “epsilon”), and must immediately be followed. Due to these epsilon arcs, the cache will not return a result for a state and input label, although the input label may appear after the epsilon arc.