The present invention relates to speech recognition systems and more particularly to a method and apparatus for recognizing speech using a general purpose shared memory multiprocessor machine.
Speech recognizers, also known as speech-to-text systems or automatic speech recognition (ASR) systems, identify words and produce a textual representation of a received speech signal. In order to accomplish this, typical speech recognizers break down human speech into several distinct layers. A phoneme, for example, is the smallest unit of speech that differentiates utterances in a given language or dialect. However, a single phoneme may be pronounced differently depending on how it is used in a word or depending on the speaker. A context dependent unit is an acoustic realization of a phoneme as manifested in a particular context. These units combine to form words which together combine to form sentences, thereby creating the basic structure of human speech. A language model maps these basic speech sounds into sentences.
A typical speech recognizer includes computer hardware and software which identifies spoken speech signals and evaluates the signal with respect to a language model to obtain a textual representation of what the speaker said. One type of speech recognizer is an isolated word recognition system which requires a speaker to pause after each spoken word so that the recognizer can identify each word in isolation. However, the rate at which speech can be inputted and processed in these recognizers is reduced and using such a system is unnatural to the speaker. Another type of speech recognizer is a continuous speech recognition system which allows a user to speak normally with no pauses in-between words. A continuous speech system allows a more natural speech flow, but because it is more difficult to distinguish where a particular word ends and where the next word begins, a continuous speech recognition system and the algorithm running on this type of system are complex.
A language model and a speech signal are inputted into a recognizer. A language model consists of, for example, one or more models of context dependent units having probability distributions associated therewith, models that map context dependent units to words, and models that map words to sentences. The speech signal is partitioned into a plurality of speech frames which may contain a portion of or a complete phone. Each frame is evaluated with respect to a subset of the context dependent phone models. The results of this process are then used to progress through the higher levels of the language model. This process continues until the recognizer processes all the speech frames in an utterance. Because of the number of calculations, associated complex processing, and the need to run in a real-time environment, existing speech recognizers are limited to isolated word recognition or sacrifice accuracy to obtain real-time performance. In addition, current speech recognizers have models that are hard-coded into the system making speech recognition possible for only limited vocabularies.
Special-purpose machines allow speech recognizers to achieve real-time or near real-time processing capability. Some special-purpose machines have been built that are specially designed to take advantage of parallelism to do speech recognition. An example is described in K. A. Wen and J. F. Wang, "Efficient computing methods for parallel processing: An implementation of the Viterbi algorithm," Computers Math. Applic., 17 (12) 1989, pages 1511-1521. However, these machines are not suitable for recognition of large-vocabulary continuous speech because they do not have the necessary generality to accommodate these large vocabularies. A drawback associated with these special purpose machines is that they are hard-coded with a particular language model and therefore can only be used for a particular recognition task. Another disadvantage with these systems is that they are designed only for isolated word recognition and are not suitable for continuous speech recognition. Moreover, none of these systems has the flexibility for receiving a language model as an input that is composed of a number of layers which are combined on-the-fly or implicitly during recognition. Therefore, none of these special-purpose machines can be used for general-purpose recognition of large-vocabulary continuous speech. In addition, special-purpose machines are prohibitively expensive, and are usually limited to development by large corporations making accessibility to the general public virtually impossible.
With the advancements in commercially available multi-processor systems, there is an opportunity to develop a continuous speech recognition system that uses a general purpose shared memory multiprocessor machine to perform continuous parallel speech recognition. There is also a need for a parallel speech recognizer that is capable of receiving a language model as an input so that much larger vocabularies as well as complex speech patterns can use the same underlying programming algorithm used for standard speech recognition tasks without requiring hard coding of a particular model.