This invention relates to speech recognition and more particularly to memory management in a speech recognition search network.
Speech recognition involves searching and comparing the input speech to speech models representing vocabulary to identify words and sentences. Continuous speech recognition is a resource-intensive algorithm. Commercial dictation software requires more than 10M bytes to install on the disk and 32M bytes RAM to run the application. Static memory is required to store the program (algorithm), grammar, dictionary, and acoustic models. These data will not change, therefore can be stored in disk or ROM. Dynamnic memory is required to run the search. The search involves parsing the input speech and building a dynamically changing search tree, therefore RAM is required for both Read and Write capabilities.
Most of the fast search algorithms involve multi-passes of search. Namely to use simple models (e.g. monophones) to do a quick rough search and output a much smaller N-best sub-space; then use detailed models (e.g. clustered triphones with mixtures) to search that sub-space and output the final results (see Fil Alleva et al. xe2x80x9cAn Improved Search Algorithm Using Incremental Knowledge for Continuous Speech Recognition,xe2x80x9d ICASSP 1993, Vol. 2, 307-310; Long Nguyen et al. xe2x80x9cSearch Algorithms for Software-Only Real-Time Recognition with Very Large Vocabulary,xe2x80x9d ICASSP, and Hy Murveit et al. xe2x80x9cProgressive-Search Algorithms for Large Vocabulary Speech Recognition,xe2x80x9d ICASSP). The first pass of using monophones to reduce the search space will introduce error, therefore the reduced search space has to be large enough to contain the best path. This process requires a lot of experiments and fine-tuning.
The search process involves expanding a search tree according to the grammar and lexical constraints. The size of the search tree and the storage requirements grow exponentially with the size of the vocabulary. Viterbi beam search is used to prune away improbable branches of the tree; however, the tree is still very large for large vocabulary tasks.
Multi-pass algorithm is often used to speed up the search. Simple models (e.g. monophones) are used to do a quick rough search and output a much smaller N-best sub-space. Because there are very few models, the search can be done much faster. However, the accuracy of these simple models are not good enough, therefore a large enough N-best sub-space has to be preserved for following stages of search with more detailed models.
Another process is to use lexical tree to maximize the sharing of evaluation. See Mosur Ravishankar xe2x80x9cEfficient Algorithms for Speech Recognition,xe2x80x9d Ph.D. thesis, CMU-CS-96-143, 1996. Also see Julian Odell xe2x80x9cThe Use of Context in Large Vocabulary Speech Recognition,xe2x80x9d Ph.D. thesis, Queens"" College, Cambridge University, 1995. For example, suppose both bake and baked are allowed in a certain grammar node, much of their evaluation can be shared because both words start with phone sequence: /b// ey/ /k/. If monophones are used in the first pass of search, no matter how large the vocabulary is, there are only about 50 English phones the search can start with. This principle is called lexical tree because the sharing of initial evaluation, and then the fanning out only when phones differ looks like a tree structure. The effect of lexical tree can be achieved by removing the word level of the grammar, and then canonicalize (remove redundancy) the phone network. For example:
% more simple.cfg
start( less than S greater than ).
 less than S greater than --- greater than  bake | baked.
bake --- greater than  b ey k.
baked --- greater than  b ey k t.
% cfg_merge simple.cfg | rg_from_rgdag |  
rg_canonicalize
start( less than S greater than ).
 less than S greater than --- greater than  b, Z_1.
Z_1 --- greater than  ey, Z_2.
Z_2 --- greater than  k, Z_3.
Z_3 --- greater than  t, Z_4.
Z_3 --- greater than  xe2x80x9cxe2x80x9d.
Z_4 --- greater than  xe2x80x9cxe2x80x9d.
The original grammar has two levels: sentence grammar in terms of words, and pronunciation grammar (lexicon) in terms of phones. After removing the word level and then canonicalizing the one level phone network, same initial will be automatically shared. The recognizer will output phone sequence as the recognition result, which can be parsed (text only) to get the word. Text parsing takes virtually no time compared to speech recognition parsing.
It is desirable to provide a method to speed up the search and reduce the resulting search space that does not introduce error and can be used independently of muilti-pass search or lexical tree.
In accordance with one embodiment of the present invention, a method of memory management which includes while expanding a search tree removing slots in the storage space with bad scores and replacing the memory space with later slots which have better scores and more likely to match the input speech. The slots contain a last time field with a first bit used for slot allocation and test and a second bit for backtrace update.