The invention relates to an automatic speech recognizer which uses multiple processing stages to determine the words contained in a spoken utterance.
Real-time speech recognition can be implemented on a variety of types of computers. An implementation of a speech recognizer, in general, uses a digital signal processor, a general purpose processor, or both. Typical digital signal processors (DSPs, such as the Texas Instruments TMS320C31) are suited for computationally intensive tasks, such as signal processing, and for low latency processing. However, available memory to a DSP is generally limited, in part, due to the cost of memory devices that allow the DSPs to execute at their full speed (i.e., without memory wait states). General purpose processors (such as the Intel Pentium) can, in general, support more memory, which is generally less costly than DSP memory, but the processors are not tailored to signal processing tasks.
A speech recognition algorithm implemented on a DSP based computer, in general, has a vocabulary size and linguistic complexity that is limited by memory resources associated with the DSP. More complex speech recognition algorithms, for example supporting larger vocabularies, have been implemented using computers based on general purpose processors, as have xe2x80x9cN-bestxe2x80x9d algorithms that produce multiple alternative hypotheses, rather than a single most best hypothesis, of what was said.
A speech recognition algorithm that is implemented using both a DSP and a general purpose processor often relies on the DSP to perform signal processing tasks, for example computing spectral features at regular time intervals. These spectral features, such as linear predictive coefficients, cepstra, or vector quantized features, are then passed from the DSP to the general purpose processor for further stages of speech recognition.
Speech recognition has been applied to telephone based input. PureSpeech Inc. has previously released a software product, Recite 1.2, that recognizes utterances spoken by telephone callers. A computer architecture on which this product can be executed is shown in FIG. 1. Computer 100 is used to interact by voice with callers over multiple telephone lines 110. Computer 100 automatically recognizes what the callers say, and can play prompts to interact with the callers. Computer 100 includes one or more telephone interfaces 130 coupled to a general purpose computer 120, such as a single-board computer, over a data bus 125. General purpose computer 120 includes a general purpose processor 122, working memory 124, such as dynamic RAM, and non-volatile program memory 126, such as a magnetic disk. Alternatively, program memory can reside on another computer and be accessed over a data network. Telephone interfaces 130 provide an interface to telephone lines 110 over which callers interact with the computer. Also coupled to general purpose computer 120 over data bus 125 are one or more DSP platforms 140. DSP platforms 140 are coupled to telephone interfaces 130 over a second bus, time division multiplexed (TDM) bus 150. TDM bus 150 can carry digitized speech between DSP platforms 140 and telephone interfaces 130. Each DSP platform 140 includes multiple DSP processors 142, working memory 144, a data bus interface 146 to data bus 125, and a speech interface 148 to TDM bus 150. In one version of the Recite 1.2 product, general purpose processor 122 is an Intel Pentium, data bus 125 is an ISA bus, DSP platform 140 is an Antares DSP platform (model 2000/30, 2000/50, or 6000) manufactured by Dialogic Corporation, and TDM bus 150 is a SCSA bus which carries telephone signals encoded as 8-bit speech samples sampled at a 8 kHz sampling rate. Each Antares DSP platform includes four DSP processors 142, TMS320C31 processors manufactured by Texas Instruments. Working memory 144 includes 512 KB of static RAM per DSP and 4 MB of dynamic RAM shared by the four DSP processors 142. Telephone interfaces 130 are any of several interfaces also manufactured by Dialogic corporation, including models D41ESC, D160SC, and D112SC. For instance, each D112SC interface supports twelve analog telephone lines 110.
PureSpeech Inc.""s Recite 1.2 product incorporates a speech recognition approach related to that described in U.S. Pat. No. 5,638,487, xe2x80x9cAUTOMATIC SPEECH RECOGNITIONxe2x80x9d, (the ""487 patent) which is incorporated herein by reference. In that implementation, each DSP processor on the DSP platforms is associated with exactly one telephone channel. A DSP associated with a particular telephone channel hosts initial stages of the recognition approach that are shown in FIG. 3 of the ""497 patent. In addition, an echo canceler stage is also included on the DSP prior to the spectral analyzer in order to reduce the effect of an outbound prompt on an inbound utterance. The DSP is essentially dedicated to the single task (process) for accepting input received from the TDM bus, processing it, and passing it to the general purpose computer. The output of the phonetic classifier is sent to the general purpose computer where a sentence level matcher is implemented. The sentence level matcher can provide multiple sentence hypotheses corresponding to likely utterances spoken by a talker.
In many speech based telephone applications, a caller is talking for a relatively small fraction of the time of a telephone call. The remainder of the time is consumed by playing prompts or other information to the caller, or by quiet intervals, for example while information is being retrieved for the caller. In the Recite 1.2 software product, one DSP is allocated for each telephone interaction, regardless of whether a caller is talking, or a prompt or information is being played. This is necessary, for example, as a caller may begin speaking before a prompt has completed. Therefore, in order to support 12 concurrent telephone conversations, three Antares DSP platforms with four DSPs each are needed to host the initial stages of the recognition approach.
Speech recognition approaches have been adapted to large vocabularies, such as lists of names in the range of 1000 to 10000 names. One aspect of recognition approaches used to achieve adequate accuracy on such large vocabularies is that a large number of subword model parameters, or a large number of subword models themselves, is typically used. A phonetic classifier is hosted on the DSP in the Recite 1.2 software. As the static RAM used for storage related to the subword models, and the amount of static RAM available to each DSP is limited, the number of subword models and their parameters is limited. This memory limitation can impact accuracy on some large vocabulary tasks.
In one aspect, in general, the invention is software stored on a computer readable medium for causing a multiprocessor computer to perform the function of recognizing an utterance spoken by a speaker. The software includes software for causing a first processor, such as a DSP processor, to perform the function of computing a series of segments associated with the utterance, each segment having a time interval within the utterance, and scores characterizing the degree of match of the utterance in that time interval with a first set of subword units, and sending the series of segments to a second processor. The software also includes software for causing the second processor, such as a general purpose processor, to perform the functions of receiving the series of segments, determining multiple word sequence hypotheses associated with the utterance, and computing scores for the word sequence hypotheses, using a second set of subword units to represent words in the word sequence hypotheses. The first set of subword units can be a set of phonemes, and the second set of subword units can be a set of context dependent phonemes.
In another aspect, in general, the invention is a method for recognizing the words in a spoken utterance. The method includes accepting data for the spoken utterance and forming a series of segments associated with the utterance. Each segment has a time interval within the utterance, and scores characterizing the degree of match of the utterance in that time interval with a set of subword units. Based on the series of segments, the method includes determining a set of word sequences hypotheses associated with the utterance and computing scores for the set of word sequence hypotheses using a second set of subword units to represent words in the word sequence hypotheses.
The invention can include one or more of the following features.
Computing scores for the multiple word sequence hypotheses can include forming a graph representation from the word sequence hypotheses, wherein the graph representation includes representations of words using the second set of subword units, and then computing scores for paths through this graph representation.
Determining the multiple of word sequence hypotheses can include determining a word graph representation wherein each of the word sequence hypotheses is associated with a path through the graph representation.
An advantage of the invention is that a multistage speech recognition can be implemented in part on a DSP processor and in part on a general purpose processor. Multiple channels can be processed by one DSP by taking advantage of the fact that caller is speaking for only a fraction of the time of a call. By sharing a preliminary recognition stage for all the channels serviced by one DSP, memory requirements for that DSP are reduced compared to having a separate preliminary recognizer for each channel. Furthermore, by sharing the preliminary recognizer on an utterance-by-utterance basis, inefficiencies introduced by context switching can be reduced.
Another advantage of the invention is that computation on the DSP can use a limited amount of memory that does not depend on the size of a vocabulary being recognized. Communication of segmental information between the DSP processor and the general purpose processor allows a set of word sequence hypotheses to be computed efficiently on the general purpose computer. By using a set of phonetically-based rules in determining the possible pronunciations of allowable word sequences, the correct word sequence is included with a high probability in the set of word sequence hypotheses that is computed. High accuracy for the top choice of word sequences is then obtained by rescoring these word sequence hypotheses on the general purpose processor, for example using a hidden Markov model (HMM) based recognition approach. This multistage recognizer allows a large number of concurrent recognition channels to be processed simultaneously using one or more DSP processors attached to the general purpose processor, while achieving high recognition accuracy.
Other features and advantages of the invention will be apparent from the following description, and from the claims.