Modern speech recognition systems are based on principles of statistical pattern recognition and typically employ acoustic models and language models to decode an input sequence of observations (also referred to as acoustic events or acoustic signals) representing an input speech (e.g., a sentence or string of words) to determine the most probable sentence or word sequence given the input sequence of observations. In other words, the function of a modern speech recognizer is to search through a vast space of potential or candidate sentences and to choose the sentence or word sequence that has the highest probability of generating the input sequence of observations or acoustic events. In general, most modern speech recognition systems employ acoustic models that are based on continuous density hidden Markov models (CDHMMs).
Most state-of-the-art HMM-based speech recognition systems employ a hierarchical structure shown in FIG. 1 to model events at different levels. Based on the fact that speech is statistically stationary over a sufficiently short period of time (between 5 and 100 msec), windows of input speech, at acoustic level, are encoded as feature vectors. At phonetics level, segments of acoustic features associated with a same phonetic unit (e.g., phoneme) are then modeled by a hidden Markov model (HMM). At word level, lattices are constructed for each word by concatenating the phonetic HMMs according to their pronunciation in a dictionary. At sentence level, a search network with word nodes are finally dynamically built and pruned according to current active paths and N-gram language model. Based upon this bottom-up structure, knowledge about acoustics, phonetics, words and syntax can be built into recognition systems for performance improvement purposes.
Chinese speech recognition systems basically are based upon the above bottom-up structure as that used for English and other languages. To attain high level of recognition accuracy and system performance, certain characteristics of Chinese spoken languages (e.g., Mandarin, Cantonese, etc.) must be considered and utilized in the design of Chinese continuous speech recognition systems. Chinese is a tonal syllabic language. Each syllable is assigned one of four or five tones. For example, each syllable in Mandarin Chinese may be assigned one of the following four or five tones: a high and level tone (also referred to as the first tone herein), a rising tone (also referred to as the second tone herein), a low and up tone (also referred to as the third tone herein), a falling tone (also referred to as the fourth tone herein), and a neutral or light tone (also referred to as the fifth tone herein). As noted, certain syllables do not have the fifth tone. Tonality plays a significant role in distinguishing meaning in Chinese language. Syllables having the same phonetic structures but with different tones usually convey different meanings. Therefore, tone is an essential part for Chinese speech recognition.
Tone recognition has been the focal point of Chinese speech recognition for decades. One of the commonly used methods is to recognize the base syllables (initials and finals) and tone separately. The base syllables are recognized by a conventional HMM-based method, for example one used in English. The tone of a syllable can be recognized by classifying the pitch contour of that syllable using discriminative rules. The recognition of toned syllables is a combination of the recognition of based syllables and the recognition of tones. This method, if possible in isolated-syllable speech recognition, is not applicable in Chinese continuous speech recognition task due to various reasons. First, in continuous speech recognition, the boundaries of the syllables are not well-defined. The boundaries are determined at the end of the entire recognition process. It is very difficult to provide syllable boundary information in the early stages of acoustic recognition. Second, the actual tone contour of a syllable with one of the five tones depends on the phonetic context. The rules to determine tones from the pitch contours, if possible, will be very complicated.
In recent years, various efforts have been directed at tone integration to Chinese continuous speech recognition systems. These systems have achieved performance improvement by treating pitch as one of the acoustic parameters, same as cepstra or energy. However, these systems lack the integration of tone knowledge at other levels of speech recognition from a system view. In other words, the tone knowledge at other levels of the speech recognition process has not been considered.