1. Field of Invention
This invention relates generally to automatic speech recognition. In particular, this invention relates to determinizing non-deterministic weighted finite-state automata used in automatic speech recognition. More specifically, this invention relates to generating approximate weighted finite-state automata from non-deterministic weighted finite-state automata.
2. Description of Related Art
One of the central problems in automatic speech recognition (ASR) is to reduce the sizes of the lattices that represent hypotheses of spoken utterances. As shown in FIG. 1, automatic speech recognition can be viewed as a processing pipeline or cascade. In each step of the processing cascade, one or two lattices are input and composed to produce an output lattice. In automatic speech recognition, the term "lattice" denotes a directed and labeled graph, which is possibly weighted. In each lattice, there is typically a designated start node "s" and a designated final node "t". Each possible pathway through the lattice from the start node s to the final node t induces a hypothesis based on the arc labels between each pair of nodes in the path. For example, in a word lattice, the arc labels are words and the various paths between the start node s and the final node t form sentences.
As shown in FIG. 1, an automatic speech recognition system 100 includes a signal processing system 110, an acoustic model lattice 120, a phonetic recognition subsystem 130, a lexicon lattice 140, a word recognition subsystem 150, a grammar lattice 160, and a task recognition subsystem 170. In operation, uttered speech is input via a microphone (not shown) which converts the sound waves of the uttered speech into an electronic speech signal. The electronic speech signal is input to the signal processing system on a speech signal input line 105. The signal processing subsystem 110 digitizes the electronic speech signal to generate a feature vector lattice 115. The feature vector lattice 115 is a lattice of acoustic feature vectors. The feature vector lattice 115 is input along with the acoustic model lattice 120 to the phonetic recognition subsystem 130. The acoustic model lattice 120 represents a set of acoustic models and is applied to transform the feature vector lattice 115 into a phone lattice. Each node of the phone lattice represents a spoken sound, such as, for example, the vowel /e/ in "bed".
The phone lattice 135 is input along with the lexicon lattice 140 into the word recognition subsystem 150. The lexicon lattice 140 describes different pronunciations of various words and transforms the phone lattice 135 into a word lattice 155. The word lattice 155 is then input, along with the grammar lattice 160, into the task recognition subsystem 170. The grammar lattice 160 represents task-specific information and is used to extract the most likely sequence of words from the word lattice 155. Thus, the task recognition subsystem 170 uses the grammar lattice 160 to extract the most likely sentence from the word lattice 155. The most likely sequence is output as the recognized text 175.
In particular, one conventional method of implementing automatic speech recognition forms each of the acoustic model lattice 120, the lexicon lattice 140 and the grammar lattice 160 as a finite-state transducer. Thus, each of the phonetic recognition subsystem 130, the word recognition subsystem 150, and the task recognition 170 performs a generalized composition operation between its input finite-state transducers. In addition, the signal processing subsystem 110 outputs the features vector lattice 115 as a finite-state transducer.
FIGS. 2A-2C show more generalized forms of the automatic speech recognition transduction cascade. In particular, in automatic speech recognition, the finite-state transducers are weighted finite-state automata, which represent and compute membership information on weighted languages such as, for example, sets of words or sentences. In particular, the size of a weighted finite-state automata is critical to automatic speech recognition. Conventionally, analogs to classical automata determinization and minimization are used to reduce weighted finite-state automata size. However, not all weighted finite-state automata can be determinized. Furthermore, for those that can be determinized, the equivalent deterministic weighted finite-state automaton can be exponentially larger. The nature of automatic speech recognition, however, appears to constrain the weighted finite-state automata so that determinization is not only possible, but often produces a deterministic weighted finite-state automaton that is smaller than the equivalent non-deterministic weighted finite-state automaton.
One method for determinizing and minimizing non-deterministic weighted finite-state automata is described in "Finite-State Transducers in Language and Speech Processing", M. Mohri, Computational Linguistics 23, 1997 (Mohri), herein incorporated by reference in its entirety. This method for determinizing and minimizing non-deterministic weighted finite-state automata generates significant average size reductions for the equivalent deterministic weighted finite-state automata. However, this method does not reduce all non-deterministic weighted finite-state automata, presenting a problem for real-time automatic speech recognition.