This invention relates to a system and method for optimization of searching for continuous speech recognition.
Speech recognition for applications such as automated directory enquiry assistance and control of operation based on speech input requires a real time response. Spoken input must be recognized within about half a second of the end of the spoken input to simulate the response of a human operator and avoid a perception of unnatural delay.
Processing of speech input falls into five main steps: audio channel adaptation, feature extraction, word end point detection, speech recognition, and accept/reject decision logic. Pattern recognition generally, and more particularly recognition of patterns in continuous signals such as speech signals, requires complex calculations and is dependent on providing sufficient processing power to meet the computational load. Thus the speech recognition step is the most computationally intensive step of the process
The computational load is dependent on the number of words or other elements of speech, which are modeled and held in a dictionary, for comparison to the spoken input (i.e. the size of vocabulary of the system); the complexity of the models in the dictionary; how the speech input is processed into a representation ready for comparison to the models; and the algorithm used for carrying out the comparison process. Numerous attempts have been made to improve the trade off between computational load, accuracy of recognition and speed of recognition.
Examples are described, e.g., in U.S. Pat. No. 5, 390,278 to Gupta et al., and U.S. Pat. No. 5,515,475 to Gupta et al. Many other background references are included in the above referenced copending applications.
In order to provide speech recognition which works efficiently in real time, two approaches are generally considered. The first is to make use of specialized hardware or parallel processing architectures. The second is to develop optimized search methods based on search algorithms that yield reasonable accuracy, but at a fraction of the cost of more optimal architectures. The latter approach is favored by many researchers, since it tackles the problem at the source, see for example, Schwartz, R., Nguyen, L., Makhoul, J., xe2x80x9cMultiple-pass search strategiesxe2x80x9d, in Automatic Speech and Speaker Recognition, Lee, C. H., Soong, F. K., Paliwal, K. K. (eds.), Kluwer Academic Publishers (1996), pp 429-456. This approach is appealing since the hardware and algorithmic optimizations are often orthogonal, so the latter can always be built on top of the former.
The basic components of a spoken language processing (SLP) system include a continuous speech recognizer (CSR) for receiving spoken input from the user and a Natural Language Understanding component (NLU), represented schematically in FIG. 1. A conventional system operates as follows. Speech input is received by the CSR, and a search is performed by the CSR using acoustic models that model speech sounds, and a language model or xe2x80x98grammarxe2x80x99 that describes how words may be connected together. The acoustic model is typically in the form of Hidden Markov Models (HMM) describing the acoustic space. The language knowledge is usually used for both the CSR component and the NLU component, as shown in FIG. 1, with information on grammar and/or statistical models being used by the CSR, and semantic information being used by the NLU. The structure of the language is often used to constrain the search space of the recognizer. If the goal is to recognize unconstrained speech, the language knowledge usually takes the form of a statistical language model (bigram or trigram). If the goal is to recognize a specific constrained vocabulary, then the language knowledge takes the form of a regular grammar.
The search passes the recognized word strings representing several likely choices, in the form of a graph, to the natural language understanding component for extracting meaning from the recognized word strings. The language model provides knowledge to the NLU relating to understanding of the recognized word strings. More particularly the semantic information from the language knowledge is fed exclusively to the NLU component with information on how to construct a meaning representation of the CSR""s output. This involves, among other things, identifying which words are important to the meaning and which are not. The latter are referred to as non-keywords or semantically-null words. Thus semantically-meaningful words and semantically-null words are identified to provide understanding of the input, and in the process, the word strings are converted to a standard logical form. The logical form is passed to a discourse manager DM, which is the interface between the user and the application. The DM gathers the necessary information from the user to request the applications to perform the user""s goal by prompting the user for input.
While the terms xe2x80x98grammarxe2x80x99 and xe2x80x98language modelxe2x80x99 are often used interchangeably, in this application, a language model is defined as the graph that is used by the CSR search algorithm to perform recognition. A grammar is a set of rules, which may also be represented as a graph, used by the NLU component to extract meaning from the recognized speech. There may be a one to one mapping between the language model and the grammar in the case where the language model is a constrained model. Connected Word Recognition (CWR) is an example of the latter. Nevertheless, known spoken language systems described above separate language knowledge into grammar and semantic information, and feed the former to the CSR and feed the latter to the NLU.
Most search optimization techniques involve reducing computation by making use of local scores during the decoding of a speech utterance. Copending U.S. application Ser. No. 09/118,621 entitled xe2x80x9cBlock algorithm for pattern recognitionxe2x80x9d, referenced above describes in detail an example of a search algorithm and scoring method.
For example, the Viterbi beam search, without a doubt the most widely used optimization, prunes the paths whose scores (likelihoods) are outside a beam determined by the best local score. Some neural-network based approaches threshold the posterior probabilities of each state to determine if it should remain active (Bourlard, H. Morgan, N., xe2x80x9cConnectionist Speech Recognition-A Hybrid Approachxe2x80x9d, Kluwer Academic Press, 1994.)
Another important technique that helped reduce the computation burden was the use of lexical trees instead of dedicated acoustic networks as described by Ney, H., Aubert, X., xe2x80x9cDynamic Programming Search Strategies: From Digit Strings to Large Vocabulary Word Graphsxe2x80x9d, in Automatic Speech and Speaker Recognition, Lee, C. H., Soong, F. K., Paliwal, K. K. (eds.), Kluwer Academic Publishers (1996), pp 385-411. Along with that idea came language model look-ahead techniques to enhance the pruning described by Murveit, H., Monaco, P., Digalakis, V., Butzberger, J., xe2x80x9cTechniques to Achieve an Accurate Real-Time Large-Vocabulary Speech Recognition Systemxe2x80x9d, in ARPA Workshop on Human Language Technology, pp 368-373.
While these techniques are undisputedly effective at solving these specific problems, in all cases, the sole sources of xe2x80x9clanguage knowledgexe2x80x9d used to reduce the search space are the language model and the grammar layout; semantic information is not used by the CSR.
Word spotting techniques are an attempt to indirectly use semantic information by focusing the recognizer on the list of keywords(or key phrases) that are semantically meaningful. Some word spotting techniques use background models of speech in an attempt to capture every word that is not in the word spotters dictionary, including semantically null words (non-keywords) (Rohlicek, J. R., Russel, W., Roukos, S., Gish, H., xe2x80x9cWord Spottingxe2x80x9d, ICASSP 1989, pp 627-630).
While word spotting is generic, it is very costly and provides poor accuracy, especially when there is prior knowledge of which non-keywords are likely to be used. Because these latter models are so broad, they do not always efficiently model non-keywords which are likely to occur in an utterance (for example, hesitations, and polite formulations).
To overcome the low accuracy problems encountered in word spotting, Large Vocabulary Continuous Speech Recognizers, LVCSR, are used in the hope that any semantically null word will exist in the recognizers vocabulary (Weitraub, M., xe2x80x9cLVCSR Log-Likelihood Ratio Scoring For Keyword Spottingxe2x80x9d, ICASSP 1995, Vol 1, pp 297-300). The output of the recognizer in this case is a string of keywords and non-keywords that is later processed by an NLU module to extract meaning. Language knowledge is separated into grammar and statistical information which are used by the CSR, and semantic information that is used by the NLU.
In all these approaches, the CSR recognizer simply outputs a string of keywords and non-keywords for further processing using semantic information: it does not make use of semantic information during the search. Consequently there is a need for further optimization of continuous speech recognizers.
Thus, the present invention seeks to provide a system and method for optimization of searching for continuous speech recognizers which overcomes or avoids the above mentioned problems.
Therefore, according to a first aspect of the present invention there is provided a method for continuous speech recognition comprising: incorporating semantic information during searching by a continuous speech recognizer.
Beneficially, incorporating semantic information during searching comprises searching using semantic information to identify semantically null words and thereby generate an N-best list of salient words, instead of an N-best list of both salient and semantically null words.
The savings, which reduce processing time both during the forward and the backward passes of the search, as well as during re-scoring, are achieved by performing only the minimal amount of computation required to produce an exact N-best list of semantically meaningful words (N-best list of salient words). This departs from the standard Spoken Language System modeling in which any notion of meaning is handled by the Natural Language Understanding (NLU) component. By expanding the task of the recognizer component from a simple acoustic match to allow semantic information to be fed to the recognizer, significant processing time savings are achieved. Thus, for example, it is possible to run an increased number of speech recognition channels in parallel for improved performance, which may enhance users"" perception of value and quality of service.
According to another aspect of the present invention, there is provided a method for continuous speech recognition comprising: providing speech input to a continuous speech recognizer; providing to the continuous speech recognizer an acoustic model comprising a set of Hidden Markov Models, and a language model comprising both grammar and semantic information; performing recognition of speech input using semantic information to eliminate semantically null words from the N-best list of words and restrict searching to an N-best list of salient words; and performing word matching to output from the speech recognizer the N-best salient word sequences.
Advantageously, the step of performing recognition comprises: detecting connected word grammars bounded by semantically null words; collapsing each list of semantically null words into a unique single-input single-output acoustic network; and identifying stop nodes in the acoustic network.
Thus, during a forward pass of a search, forward stop nodes are detected, signaling the search to stop forward scoring along a path currently being followed, and during a backward pass of the search backward stop nodes are detected, signaling the search to stop backward scoring along a path currently being followed. Then, for example, right-most semantically null networks are not computed, and some semantically salient words are not backward-scored. Thus an N-best list of only salient words is re-scored instead of a true N-best list.
Advantageously, scoring comprises Viterbi scoring or other known methods. The method above may be combined with other techniques to save processing time. For example, searching may alternatively be based on beam searches and lexical trees to provide benefits of those methods in addition to benefits of the method above.
According to another aspect of the invention there is provided software on a machine readable medium for performing a method of continuous speech recognition comprising: incorporating semantic information during searching by a continuous speech recognizer.
Preferably, the method comprises searching using semantic information to identify semantically null words and thereby generate a list of N-best salient words.
Yet another aspect of the invention provides software on a machine readable medium for performing a method for continuous speech recognition comprising: providing speech input to a continuous speech recognizer; providing to the continuous speech recognizer an acoustic model comprising a set of Hidden Markov Models, and a language model comprising both grammar and semantic information; performing recognition of speech input using semantic information to eliminate semantically null words from the N-best list of words and restrict searching to an N-best list of salient words.
Another aspect of the invention provides a system for continuous speech recognition comprising:
means for incorporating semantic information during searching by a continuous speech recognizer; input means for providing speech input to the continuous speech recognizer; means for providing to the continuous speech recognizer an acoustic model comprising a set of Hidden Markov Models, and a language model comprising both grammar and semantic information; the continuous speech recognizer comprises means for performing recognition of speech input using the semantic information for eliminating semantically null words from the N-best list of words and thereby restricting searching to an N-best list of salient words, and performing word matching to output the N-best salient word sequences.
According to a further aspect of the present invention there is provided a spoken language processing system for speech recognition comprising: a continuous speech recognition component (CSR); a natural language understanding component (NLU); means for providing speech input to the CSR; means for providing acoustic-phonetic knowledge to the CSR comprising a set of Hidden Markov Models; means for providing language knowledge comprising grammar and statistical models to the CSR, and means for providing semantic knowledge the NLU, and means for providing semantic knowledge to the CSR; the CSR being operable for searching using the semantic knowledge to constrain the search to an N-best list of salient words, and perform word matching to output N-best list of salient words to the NLU for interpretation of meaning.
Another aspect of the present invention provides a method for continuous speech recognition using a spoken language system comprising a continuous speech recognition component (CSR) linked to a natural language understanding component (NLU); providing speech input to the CSR; providing acoustic-phonetic knowledge to the CSR comprising a set of Hidden Markov Models; providing language knowledge comprising grammar and statistical models to the CSR; providing language knowledge semantic knowledge to the CSR; performing searching with the CSR using the semantic knowledge to constrain the search to an N-best list of salient words comprising semantically meaningful words of the N-best list of words; and, performing word matching to output the N-best salient word sequences to the NLU.
The method and system described above may be combined with other techniques to save processing time. For example, searching may alternatively be based on beam searches and lexical trees to provide benefits of those methods in addition to benefits of the method described above.
Thus systems and methods are provided which allow considerable savings in computation time, so that more complex speech applications may be implemented on smaller and older platforms. Thus existing products with older processors may advantageously be upgraded to provide extended services. In newer products and processors, the number of simultaneous channels that can be supported is higher, reducing the cost of deploying services. Improved performance may enhance users perception of value and quality of service.