The present invention relates to speech recognition systems and methods using weighted-automata networks. More particularly, the present invention relates to such networks comprising transducers corresponding to speech recognition system elements, such as a pronunciation dictionary and acoustic, language and context-dependency models. Still more particularly, embodiments of the present invention relate to such networks in which context-dependency is reflected in efficient fully expanded models for large vocabularies.
Workers in the field of speech recognition have developed a number of useful systems, functionalities and techniques, which, when used individually or in combination increase the accuracy and speed of automatic machine recognition of input speech signals. For example, it has proven advantageous to identify and implement various models pertaining to speech and methods for recognizing speech. Thus, acoustic models, context-dependency models, language models and pronunciation dictionaries are often part of an overall speech recognition system.
One technique that has proven valuable in implementing speech recognition systems is the representation of the collection of models and elements in a speech recognition system as a network of finite-state transducers. See, for example, Berstel, J., Transductions and Context-Free Languages, Teubner Studienbucher, Stuttgart, Germany, 1979; Eilenberg, S., Automata, Languages, and Machines, Vol. A, Academic Press, San Diego, Calif., 1974; Kuich, W., and A. Salomaa, Semirings, Automata, Languages, No. 5 in EATCS Monographs on Theoretical Computer Science, Springer-Verlag, Berlin, Germany, 1986.
Such transducers are finite-state networks in which each arc is typically labeled with an input symbol, an output symbol and weight (often probability, or negative log probability). Optionally, the input (output) symbol on an arc may be the null symbol, ,, indicating that the arc does not consume input (produce output). A path in a transducer pairs the concatenation of the input labels on its arcs with the concatenation of the corresponding output labels, assigning the pair the extend function of the arc weights (often the sum function).
The transducer representation of models provides a natural algorithm, composition, for combining multiple levels of modeling. The composition of two weighted transducers, S and T, is a transducer S B T that assigns the weight w to the mapping from symbol sequence x to sequence z just in case there is some symbol sequence y such that S maps x to y with weight w, T maps y to z with weight v, and w=extend(u,v). The states of S B T are pairs of a state of S and a state of T, and the arcs are built from pairs of arcs from S and T with paired origin and destination states such that the output of the S arc matches the input of the T arc (null transition, ,, labels need to be handled specially). It is well known and readily demonstrated that the composition operator B is associative, i.e., the order of the B operations does not matter.
Using a transducer composition algorithm, weighted automata have proven useful in combining input acoustic observations with a pronunciation dictionary and acoustic, context-dependency and language models for speech recognition applications. See, for example, Mohri, M., F. Pereira, and M. Riley, xe2x80x9cWeighted automata in text and speech processing,xe2x80x9d in ECAI-96 Workshop, Budapest, Hungary, 1996; and Pereira, F., and M. Riley, xe2x80x9cSpeech Recognition by Composition of Weighted Finite Automata,xe2x80x9d in Finite-State Language Processing, Ir E. Roche and Y. Schabes, editors, pp. 431-453, MIT Press, Cambridge, Mass., 1997. These papers are hereby incorporated by reference and should be considered as set forth in their entirety herein.
Particular models that can be implemented in transducer form include the language model and dictionary. While it is generally not possible to combine these models into a single network when the models are dynamically changing, the possibility exists for fixed models that a combined network might be realized in advance of use. However, the practicality of such a combined network depends on the optimization methods used. If these methods are not properly chosen in constructing a combined model, the size of the combined network can increase beyond reasonable limits. In particular, fully expanded context-dependent networks including cross-word context-dependent models with a pronunciation dictionary and an n-gram language model have proven to be too large to be stored or used in an efficient speech recognizer for large vocabulary applications.
The limitations of the prior art have been overcome and a technical advance made in accordance with the present invention described in illustrative embodiments below.
In accordance with an aspect of the present invention, networks representing various phases of the automatic speech recognition process are combined and optimized by moving labels and weights in such manner as to permit the merging or collapsing of paths in such network while preserving the desired mappings between input strings and corresponding output strings and weights. Deterministic minimal transducers are realized using a novel method that provides a compact representation of the overall recognition network.
In one illustrative embodiment, transducers reflecting fully expanded context dependency and dictionary and language models are combined in such a practically realizable network. In particular, fully expanded context-dependent phone models for a large vocabulary application may be realized while modestly increasing the model size compared with the corresponding word-level n-gram language model. An improved model of these proportions can be directly processed by a well-known (e.g., Viterbi) decoder without dynamic expansion.
In realizing such a fully expanded model, context dependency constraints are represented by a transducer C, rather than being imbedded in the decoder. Such transducer representation permits variation of alternative context dependencies, and alternative combinations of context dependencies with other models. This flexibility offers increased choices in selection of optimization techniques for weighted automata while avoiding any need for changes to the decoder.
Structuring of illustrative fully expanded networks advantageously employs an efficient transducer composition organization and method. In addition, transducer determinization is advantageously employed in illustrative embodiments to reduce the number of alternative arcs that need be considered during decoding. Moreover, removal for weighted automata is desirably effected in developing combined models. Using present inventive methods and systems, an optimized network is advantageously structured a priori, rather than on the fly or with special knowledge of a particular input string. Such organizational structure provides greater operational efficiencies for a wide range of input speech.
Embodiments of the present invention are used in illustrative problems involving the North American Business News (NAB) speech recognition task to demonstrate the practical realization of speech recognizers.