In automatic speech recognition, a user creates sound detected by an audio circuit. Audio information is provided by the audio circuit to a speech decoder. The speech decoder produces outputs indicative of words that are resolved from inputs from the audio circuit. Many forms of known models for phones and words are used to produce a best correlation between a spoken word and a word in the vocabulary of the speech decoder. A phoneme is a basic unit of sound in a language. Most languages can be resolved into approximately 45 phonemes. A phone is a unit based on a phoneme That is resolved by the decoder.
After the decoder resolves phone and word level information, further operations are utilized to generate further information, particularly information vectors, on which a confidence measure the word resolved will be based. Different known schemes utilize language models, parsers or other measures utilizing decoder outputs as input information. A confidence measure is produced by a classifier. The output of the classifier is generally compared to a threshold level. If the confidence level is at least equal to the threshold level, then an accept signal is produced. If the confidence level is lower than the threshold, a reject signal is provided Misread words or out of vocabulary words(OOVs) are not reported by the speech recognizer.
The classifiers may use different methods of analysis to produce an output Most commonly utilized classifiers comprise Linear Discriminant Analysis (LDA) classifiers and Artificial Neural Network (ANN) classifiers. It is desirable to provide improved automatic speech recognition systems in which an improved neural network classifier interacts within the system. Neural networks require the use of training algorithms so that they can determine when a speech result appears to be reliable. Particularly for larger neural networks, fast convergence is required. In other words, within a limited number of layers, the information vectors generated based on outputs of the decoder must be made to converge to a single dimension. Additionally, generalization, the ability To perform when operating on new test data must also be provided.
Various prior art training algorithms have been used for generation of confidence measurements. The most well-known is the Error Back-forward Propagation (EBP) algorithm. This algorithm is inefficient in that it has slow convergence and long training time compared to other competing techniques. One draw back is the false saturation of output nodes. The algorithm is also know as the Mean Square Error (MSE). As an alternative, the Cross Entropy (CE) error function has been used to resolve the false saturation, leading to fast convergence. However, this algorithm also leads to over specialized learning. Therefore, the trained network does not provide for generalization.