Current speech recognition technology involves the recognition of patterns in acoustic data and the correlation of those patterns with a predetermined set of dictionary entries that are recognized by the system. The speech recognition problem is an extremely challenging one because there are so many different variants. In general, the speech recognizer applies the incoming acoustic data in digital form to a mathematical recognition process that converts the digital data into parameters based on a predefined model.
Conventionally, the model is one that has been previously trained using a sufficiently large training set so that individual speaker variances are largely abated. The model-based recognition process segments the incoming data into elemental components, such as phonemes, that are then labeled by comparison with the trained model. In one form of recognizer, once individual phonemes have been labeled, the phoneme data are compared with prestored words in the system dictionary. This comparison is performed through an alignment process that will accommodate imprecise matches due to incorrect phoneme recognition as well as insertion and deletion of phoneme within a given sequence. The system works on the basis of probabilities. Conventionally, the speech recognizer will select the most probable word candidate that results from the previously described segmentation, labeling and alignment process.
By their very nature, current speech recognizers select word candidates from the predefined dictionary and hence they will only recognize a predefined set of words. This raises a problem, particularly in systems where further decisions are made based on the results of speech recognition. Extraneous noise or speech utterances of words that are not found in the dictionary are often incorrectly interpreted as words that are found in the dictionary. Subsequent decisions based on such incorrect recognition can lead to faulty system performance.
To illustrate the problem, consider a spelled name telephone call routing application. The user is instructed by synthesized voice prompt to spell the name of the person to whom the call should be routed. If the user follows these instructions, the speech recognizer identifies each of the letters uttered and then is able to look up the spelled name by alignment of the sequence of letters with the dictionary. The system then routes the call to the appropriate extension using routing information found in the dictionary. However, if the user utters extraneous information first, such as by pronouncing the person's name prior to spelling, the recognition process is highly likely to fail. This is because the recognition system expects to receive only a sequence of uttered letters and will attempt to "recognize" the spoken name as one or more letters. The conventional system simply is not equipped to properly segment the incoming acoustic data, because the underlying model on which the system is built assumes a priori that the data are all equivalent units (spoken letters) that are useful or meaningful to the system.
The present invention solves the aforementioned problem through a speech recognition system that employs and integrates multiple grammar networks to generate multiple sets of recognition candidates, some based on a model that assumes the presence of extraneous speech, and some based on a model that assumes the absence of extraneous speech. The results of both models are used to make the ultimate recognition decision, relying on the respective match probability scores to select the most likely candidate.
According to one aspect of the invention, the acoustic speech data are separately processed using different first and second grammar networks that result in different segmentation of the acoustic speech data. In this way, the system will extract speech that has utility from speech that does not. For each grammar network, a plurality of recognition candidates are generated. The preferred embodiment generates the N-best candidates using a first grammar network and the M-best candidates using the a second grammar network, where N and M are integers greater than one and may be equal. The first and second plurality of recognition candidates (N-best, M-best) are transformed based on a least one set of a priori constraints about the speech that has utility. The transformation may comprise, for example, matching the candidates to a dictionary of spelled names that are recognized by the system. The recognition decision is then based on the transformed recognition candidates.
As will be more fully explained below, the invention splits the acoustic speech data into two or more paths that are each handled differently. One path is processed using a first grammar network based on the assumption that only useful utterances (e.g., letters) are supplied. Another path is processed using a different grammar network that assumes extraneous, nonuseful speech precedes the useful speech. The different grammar networks thus result in different segmentation of the data.
The recognition candidates generated by each path may each be scored according to how well each candidate matched the respective models. Rather than requiring the two paths to compete at this stage-in order to select the single candidate with the highest score-the two sets of recognition candidates are maintained separate. At this stage, the recognition candidates represent the N-best and M-best letter sequence hypotheses. To select which hypothesis is the best candidate, both sets are separately matched to the dictionary of all names that are recognized by the system.
The dictionary is, in effect, a collection of a priori constraints about the speech that will have utility to the system. Thus certain letter sequence hypotheses may be scored less probable because those letter sequences do not well match the letter sequences stored in the dictionary. The presently preferred embodiment uses the N-best and M-best letter sequences to select the N-best and M-best names from the dictionary. Thus contributions from both paths are included in the decision making process. Finally. the N-best and M-best sets of names may be combined to form a reduced set of dictionary candidates to which the input utterance is applied.
This reduced size dictionary can be used to build a dynamic grammar that is built from the N-best and M-best name candidates. This dynamic grammar will tend to favor one set of candidates or the other, depending on whether the input utterance contains extraneous speech or not. If extraneous speech is present, the grammar network designed to identify and reject the extraneous speech will tend to produce better recognition results. and those results will be reflected as the better candidates in the dynamic grammar that is constructed from the N-best and M-best name candidates. On the other hand, if no extraneous speech is present, the other grammar network will produce better recognition results, and it will be better reflected as the better candidates in the dynamic grammar.
Once the dynamic grammar is constructed, the input acoustic speech data may be processed using a recognizer based on the dynamic grammar to extract the single most probable name candidate as the recognized name. The recognized name is then used to access a suitable database for routing the telephone call appropriately.