Finite-state transducers (FSTs) have been used in automatic speech recognition (ASR) as a tool for mapping representations of input speech to recognized forms, such as text. An FST can be constructed to “transduce” input of any form to output of any other form; thus, FSTs can be used to represent any of various components of an ASR system, such as an acoustic model that maps input acoustic features of speech to phonemic representations, a phonetic model that maps a phonemic representation of input speech to words, a linguistic model that maps words to recognizable word sequences (e.g., sentences), etc. In general, an FST can be represented as a directed graph structure in which states are connected by arcs, each arc having a specified direction of transition (e.g., from a source state to a target state) and one or more labels. The labels may include an input and an output, and in some cases a weight, although some arcs may lack an input, an output, and/or a weight. In traversing an FST as a component of speech recognition, the ASR system transitions from state to state within the FST by following arcs defined by the speech input received. While traversing a particular state in the FST, the next input (e.g., acoustic frame, phone, or word) in the received sequence determines which arc to follow to transition out of the current state, and the arc's target (destination) state determines the next state to process in the FST. A particular path of arcs and states from the initial state to the final state in the FST corresponds to a particular input sequence (corresponding to the input labels of the arcs along the path), and the output labels of the arcs in that path determine the output sequence (e.g., the recognition result) provided by the FST for that particular input. Further background on FSTs and their use in ASR can be found in Mohri et al., “Speech Recognition with Weighted Finite-State Transducers,” in Rabiner and Huang (eds.), Handbook on Speech Processing and Speech Communication, Part E: Speech Recognition, Springer-Verlag, 2008. This publication is incorporated herein by reference in its entirety.
In some ASR systems, a number of FSTs may be compiled together into a single graph-based data structure for use during recognition. For example, FSTs representing different recognizable word sequences may be compiled into a single FST representing the grammar or language model of the ASR system. As another example, FSTs representing phonetic models of word pronunciations may be compiled with one or more FSTs representing linguistic models of recognizable word sequences, to create an integrated FST that transduces from a phone-level input sequence to a recognized word sequence (e.g., sentence). In yet another example, acoustic models, such as hidden Markov models (HMMs), may be compiled with phonetic models, or with both phonetic and linguistic models, to create an integrated graph structure that transduces from an input sequence of acoustic frames to recognized words, or to a recognized word sequence (e.g., one or more sentences).