The invention relates to automatic speech recognition and more particularly to a method and apparatus for continuous speech recognition using a self-adjusting decoder network having multiple layers.
Two decoder techniques are commonly known in the speech recognition field. The first of these two techniques is called the best-first based stack decoder. This first technique accommodates long span language models, multi-word grammars and increased vocabulary size. However, the best-first stack decoder method requires good search heuristics in order to estimate the least upper bound of the speech recognition path score so that search errors can be avoided and complexity reduced. The second of these two techniques is called the breadth first based beam decoder. The breadth first beam search technique does not require heuristics and the search can be made frame synchronous to the incoming speech data frames. However, the breadth first beam search decoder requires considerable circuit resources in order to support and maintain the large number of active nodes that typically are created and maintained during a beam search.
Thus, there is a need in the art for a speech recognition decoder that combines the resource advantages of the best-first stack search decoder and the accuracy advantages of the breadth first beam search decoder into a new search decoder.
Briefly stated in accordance with one aspect of the invention, the aforementioned need is achieved by providing a continuous speech recognizer that is based on three dynamically expanded networks, each having a self-adjusting capability.
In accordance with one aspect of the invention, the aforementioned need is provided by a system for recognizing speech which includes a converter for converting input speech into frames of speech data. The speech data is inputted to a dynamic programming network which receives the frames of speech data and builds nodes that represent likelihood scores of various pre-defined models corresponding to the speech data of the respective frame. An asynchronous phone expanding network operates in parallel with said dynamic programming network, and provides phone rules that control which nodes of said dynamic programming network can be connected by arcs to other nodes dependent upon said speech data. It should be noted that the word xe2x80x98phonexe2x80x99 in this application is taken directly from the Greek word xe2x80x98phonexe2x80x99, which means sound and/or speech. Additionally, an asynchronous word network operates in parallel with the phone network and the dynamic programming network to provide word rules that control which portions of the phone network correspond to recognizable words and which do not correspond to recognizable words. The dynamic programming network, the phone network and the word network cooperating to process the speech data frames to recognize the input speech.
In accordance with another aspect of the invention, the aforementioned need is achieved by providing a speech recognition system including: a converter for converting input speech into frames of speech data; a dynamic programming process that establishes a plurality of nodes in response to the frames of speech data and arc paths connecting to others of the plurality of nodes thereby forming a speech decoder network; a phone rule driven process that applies predetermined phone rules for the speech decoder network to establish a phone network and increase the accuracy of an output of the speech recognition system; and a word rule driven process that applies pre-determined word rules for the speech decoder network and the phone network to increase also the accuracy of the output of the speech recognition system.
In accordance with another aspect of the invention, the aforementioned need is achieved by providing a decoder for continuous speech recognition using a processor and a memory having a plurality of memory locations. The decoder has a speech framer for regularly processing input speech into consecutive frames of acoustic data. Connected to the output of the speech framer are a word network process for storing and applying language rules, a phone network process for storing and applying phone rules; and a dynamic programming network process. The dynamic programming network process processes the acoustic data to build a network of nodes connected by arcs which provide possible decodings of the input speech. The dynamic programming network process also uses information from the word network process and the phone network process to direct the building of the nodes and the connection of each node to previous nodes by arcs.