1. Technical Field
The present invention relates to the field of speech processing and speech recognition in general. In particular, the invention relates to systems and methods for generating an output by means of a speech input.
2. Description of the Related Art
Due to recent advances in computer technology, as well as recent advances in the development of algorithms for speech recognition and processing, speech recognition systems have become increasingly more powerful while becoming less expensive. Certain speech recognition systems can match the words to be recognized with words of a vocabulary. The words in the vocabulary usually are represented by a word models, which can be referred to as word baseforms. For example, a word can be represented by a sequence of Markov models. The word models can be used in connection with the speech input in order to match the input to the words in the vocabulary.
Most of today's speech recognition systems are continuously being improved by providing larger vocabularies or by increasing the recognition rate by employing improved algorithms. Such systems typically can include 100,000 words. Other products, for example the ViaVoice family of software available from International Business Machines Corporation, can include approximately 240,000 word entries. Many commercially available speech recognition systems operate by comparing a spoken utterance against each word in the system's vocabulary. Since each such comparison can require thousands of computer instructions, the amount of computation required to recognize an utterance grows dramatically with increasing vocabulary size. This increase in computation has been a major problem in the development of large vocabulary systems.
Some speech recognition systems can be trained by the user uttering a training text of known words. Through this training process, the speech recognition system can be tailored to a particular user. Such training can lead to an improved recognition rate. Additionally, there are bi-gram and tri-gram based recognition systems that can search for like-sounding words such as ‘to’, ‘two’, and ‘too’, by analyzing such words in a context of two consecutive words (di-gram technology) or three consecutive words (tri-gram technology). The di-gram technology and the tri-gram technology also can lead to an improved recognition rate.
One problem of conventional speech recognition systems can be that as the system vocabulary grows, the number of words that are similar in sound also tends to grow. As a result, there is an increased likelihood that an utterance corresponding to a given word from the vocabulary will be mis-recognized as corresponding to another similar sounding word from the vocabulary.
Different approaches are known in the art for reducing the likelihood of word confusion. One such method is called “pruning”. Pruning is a common computer technique used to reduce a computation. Generally speaking, pruning reduces the number of cases which are considered by eliminating some cases from further consideration. Scores (representing the likelihood of occurrence in an input) can be assigned to the words in a vocabulary. The scores can be used to eliminate words from consideration during the recognition task. The score can be updated during the recognition task and words which are deemed irrelevant for the recognition are not considered any further.
Another technique used to cope with large vocabulary systems is that of hypothesis and test, which is, in effect, also a type of pruning. When features are observed in a speech input, the features are used to form a hypothesis that the word actually spoken corresponds to a subset of words from the original vocabulary. The speech input can be processed further by performing a more lengthy match of each word in this sub-vocabulary against the received acoustic signal. This sub-vocabulary is directly derived from the speech input.
Yet another approach for dealing with the large computational demands of speech recognition in large vocabulary systems, is the development of special purpose hardware to increase significantly the speed of such processing. There are for example special purpose processors that perform probabilistic frame matching at high speed.
There are a host of other problems which have been encountered in known speech recognition systems. These problems can include, but are not limited to, background noise, speaker-dependent utterance of words, and insufficient processing speed. All of these disadvantages and problems have so far prevented widespread use of speech recognition in many market domains. Accordingly, despite the recent advances in speech recognition technology, there is a great need to improve further the performance of speech recognition systems before such systems find larger distribution in the market.