1. Field of the Invention.
The present invention relates generally to speech recognition systems. In particular, the present invention relates to a system and method for recognizing continuous Mandarin Chinese speech. The present invention further relates to a system and method for recognizing continuous speech of a tonal language using an integrated tone classifier.
2. Description of the Background Art.
Speech recognition systems for Mandarin Chinese and other tonal languages encounter unique problems not encountered by speech recognition systems for non-tonal languages such as Romance or Germanic languages. Mandarin Chinese is a tonal syllabic language where each syllable is assigned a tone. In Mandarin Chinese, there are 4 lexical tones (high and level, rising, falling-rising, and falling) and 1 neutral tone. Tone is characterized by the fundamental frequency contour, or pitch contour, of the audio signal. The pitch is equivalent to the fundamental frequency, and the pitch contour is equivalent to the fundamental frequency contour. Exemplary wave forms for electrical signals representing the tones of Mandarin Chinese are shown in FIGS. 1A, 1B, 1C, 1D, and 1E. The tone and syllable together define the meaning of the syllable. Syllables with the same phonetic structure, but different tones, usually have significantly different meanings. Thus, to recognize accurately an audio signal of Mandarin Chinese speech, a speech recognition system must recognize both the syllable and the tone of the syllable.
There are many prior art systems, similar to systems for non-tonal languages, that effectively analyze and identify isolated syllables of Mandarin Chinese speech. These systems have been quite successful in accurately recognizing isolated syllables. Such prior art systems generally first identify the syllable and second perform a tonal analysis. The systems then combine the results of the two steps to recognize the input.
Prior art systems that recognize continuous Mandarin Chinese speech have not been nearly as successful as systems that recognize isolated syllables. Continuous speech recognition systems must recognize both the syllable and tone of each of a plurality of syllables strung together in a continuous input. Existing continuous speech recognition systems first divide or segment an input into a sequence of fixed segments with hypothetical time alignment. The step of segmenting the input is particularly critical since errors in segmentation will propagate through and affect the recognition of both syllable and tone. There are, however, no segmentation techniques that correctly segment continuous Mandarin Chinese speech such that this approach yields enough accuracy to be satisfactory. Once these prior art systems have segmented the input, they generally use an isolated syllable recognizer and separate tone recognizer to identify each tonal syllable based on the hypothetical segments. This analysis is obviously dependent upon the segmentation step. These systems have an additional problem in that they use short-term tonal analysis which does not provide sufficient frequency resolution to identify correctly the behavior of the pitch (the fundamental frequency) contour. Moreover, the tone of a syllable may move through 3 octaves or more from one syllable to the next. In order to overcome the deficiencies of short-term tonal analysis and the difficult behavior of the tone, long-term tonal analysis is needed to model accurately the pitch contour. Long-term tonal analysis, however, is very sensitive to segmentation error. Furthermore, long-term tonal analysis is also is very time consuming. Time consumption is particularly important when a speech recognition system is being used for real-time speech applications.
In an attempt to reduce the effects of segmentation error on the ultimate recognition results, prior art continuous speech recognition systems provide multiple possible identifications of an input. Such systems determine multiple candidates for an input utterance and generate output signals of the N-best candidates or recognitions. A recognition is conventionally referred to as a theory. These systems often generate an initial confidence score with each possible recognition. Each initial confidence score is an indication of how accurately the theory matches the input. Generally, the recognition with the highest initial confidence score is accepted as the correct recognition. These prior art speech recognition systems have not utilized long-term tonal analysis because of the computational expense. Long-term tonal analysis is even more time consuming when it must be performed for each of the N-best theories.
Continuous speech recognition systems that produce multiple possible recognitions, still encounter the problem of incorporating long-term tonal analysis into an N-best recognizer. Thus, there continues to be a need for performing long-term tonal analysis, with minimal degradation due to segmentation error, for continuous speech of a tonal language.