Phoneme recognition is utilized in the fields of automatic speech recognition (see e.g. R. Gruhn et al., “A statistical lexicon for non-native speech recognition”, in Proc. of Interspeech, 2004, pp. 1497-1500), speaker recognition (E. F. M. F. Badran and H. Selim, “Speaker recognition using artificial neural networks based on vowelphonemes”, in Proc. WCCC-ICSP, 2000, vol. 2, pp. 796-802) or language identification (M. A. Zissman, “Language identification using phoneme recognition and phonotactic language modelling”, in Proc. ICASSP, 1995, vol. 5, pp. 3503-3506).
A common approach for phoneme recognition is based on a combination of artificial neural networks (ANN) and Hidden Markov Models (HMM) (e.g. H. Bourlard and N. Morgan, “Connectionist Speech Recognition: A Hybrid Approach”, Kluwer Academic Publishers, 1994). The artificial neural networks can be trained to discriminatively classify phonemes.
Context information plays an important role for improving performance of phoneme recognition, as the characteristics of a phoneme can be spread on a long temporal context (e.g. H. H. Yang et al., “Relevancy of time-frequency features for phonetic classification measured by mutual information”, in Proc. ICASSP, 1999, pp. 225-228).
A conventional way to increase the context is performed by concatenating short term features. However, the amount of temporal information given to an artificial neural network is limited by the quantity of training data. In particular, when the context is extended, an increased amount of information is given to the artificial neural network, which requires more training data for robustly determining its parameters.
One method to overcome this problem comprises splitting the context in time and dividing the classification task with several artificial neural networks followed by a combination of all of them. In J. Pinto et al., “Exploiting contextual information for improved phoneme recognition”, in Proc. ICASSP, 2008, pp. 4449-4452, for example, a phoneme is equally divided into three slides modelling states. A separate artificial neural network is trained employing the label given in the central frame of the corresponding slide. However, the performance of this approach is not significantly increased compared to a single artificial neural network, which is trained to discriminatively classify the states of all phonemes.