Speech recognition systems try to determine the semantic meaning of a speech input. Typically, probabilistic finite state speech models are used to determine a sequence of words that best corresponds to the speech input. One common example is an automotive application which allows control of applications within the passenger cabin such as navigational systems, audio, systems, climate control, etc. The recognition task in such an application represents a combination of predetermined elements such as user commands and lists of streets in a city, along with dynamically determined elements such as lists of mp3 songs, names from a personal digitized address book (e.g., from a smartphone), and navigational points of interest. For such applications, it is desirable to have compact memory efficient search networks with fast low latency performance.
The recognition task includes multiple different knowledge levels, from the acoustic forms of fundamental speech sounds known as phonemes, to the phoneme sequences that form words in a recognition vocabulary, to the word sequences that form phrases in a recognition grammar. One powerful and convenient way to organize the recognition task is based on the use of finite state machines such as finite state acceptors (FSA) and finite state transducers (FST). For convenience, the following discussion uses the term FST, though not in any particularly rigorous sense but rather in a generic inclusive way that also applies in many respects to FSAs and other specific forms of finite state machines. FIG. 1 shows a specific example of one simple FST that parses a set of word level symbols: {AAA, BA, AAB}; see e.g., Jurafsky & Martin, Speech and Language Processing, Prentice Hall 2001. pp. 33-83. ISBN 0-13-095069-6; which is incorporated herein by reference.
Implementing speech processing applications on mobile platforms such as smartphones is especially challenging in view of the limited resources that are available. Some optimizations may be implemented to improve speed, accuracy, and/or use of limited memory. Among other things, FSTs allow for a single unified representation of acoustic information (HMM set), word pronunciations, language model, and recognition grammar. Due to this unified representation, computationally inexpensive computations can be performed during traversal of the search network while other computationally expensive optimizations can be offloaded to a precompilation stage. The four most widely used optimizations for static compilation of FSTs include determinization, minimization, weight and label pushing; see e.g., Mohri, Minimization of Sequential Transducers, (5th Annual Symposium on Combinatorial Pattern Matching 1994); incorporated herein by reference.
Despite determinization, existing FST optimizations are global, i.e. they need to access a FST as a whole and therefore can't be applied in speech applications that incrementally modify the search network. Some theoretical aspects of incrementally adding sentences to a word-level FST are described more fully in Crochemore and Giambruno, On-Line Construction Of A Small Automaton For A Finite Set Of Words, The Prague Stringology Conference 2009; incorporated herein by reference.