A phoneme is the minimal unit of speech sound in a language that can serve to distinguish meaning. Phoneme recognition can be applied to improve automatic speech recognition. Other applications of phoneme recognition can also be found in speaker recognition, language identification, and keyword spotting. Thus, phoneme recognition has received much attention in the field of automatic speech recognition.
One common and successful approach for phoneme recognition uses a hierarchical neural network structure based on a hybrid hidden Markov model (HMM)-Multilayered Perceptron (MLP) arrangement. The MLP outputs are used as HMM state emission probabilities in a Viterbi decoder. This approach has the considerable advantage that the MLP can be trained to discriminatively classify phonemes. The MLP also can easily incorporate a long temporal context without making explicit assumptions. This property is particularly important for phoneme recognition because phoneme characteristics can be spread over a large temporal context.
Many different approaches have been proposed to continue to exploit the contextual information of a phoneme. One approach is based on a combination of different specialized classifiers that provides considerable improvements over simple generic classifiers. For instance, in the approach known as TRAPS, long temporal information is divided into frequency bands, and then, several classifiers are independently trained using specific frequency information over a long temporal range. See H. Hermansky and S. Sharma, Temporal Patterns (TRAPS) in ASR of Noisy Speech, in Proc. ICASSP, 1999, vol. 1, pp. 289-292, incorporated herein by reference. Another different technique splits a long temporal context in time. See D. Vasquez et al., On Expanding Context By Temporal Decomposition For Improving Phoneme Recognition, in SPECOM, 2009, incorporated herein by reference. A combination of these two approaches which splits the context in time and frequency is evaluated in P. Schwarz et al., Hierarchical Structures Of Neural Networks For Phoneme Recognition, in Proc. ICASSP, 2006, pp. 325-328, incorporated herein by reference.
Another phoneme recognition structure was proposed in J. Pinto et al., Exploiting Contextual Information For Improved Phoneme Recognition, in Proc. ICASSP, 2008, pp. 4449-4452, (hereinafter “Pinto”, incorporated herein by reference). Pinto suggested estimating phoneme posteriors using a two-layer hierarchical structure. A first MLP estimates intermediate phoneme posteriors based on a temporal window of cepstral features, and then a second MLP estimates final phoneme posteriors based on a temporal window of intermediate posterior features. The final phoneme posteriors are then input to a phonetic decoder for obtaining a final recognized phoneme sequence.
The hierarchical approach described by Pinto significantly increases system accuracy, compared to a non-hierarchical scheme (a single layer). But computational time is greatly increased because the second MLP has to process the same number of speech frames as were processed by the first MLP. In addition, the second MLP has an input window with a large number of consecutive frames, so there are a high number of parameters that must be processed. These factors make it less practical to implement such a hierarchical approach in a real time application or in an embedded system.