Tone and intonation play an important role in speech recognition and natural language understanding. In many languages, intonation, variation of pitch while speaking, can be used for emphasis, posing a question, or conveying surprise, etc. For example, in standard American English, rising pitch at the end of a phrase often indicates that the speaker is asking a question instead of making a declaration (“He bought a car?” vs. “He bought a car”). Different from English and some western languages, tone languages such as Chinese use pitch to distinguish words.
In tone languages, syllables or words which have the exact same sequence of phonemes often map to different lexical entries when they have different tone patterns (i.e., pitch contours). For example, the words “mother” (mā), “hemp” (má), “horse” (m{hacek over (a)}), and “curse” (mà) are all pronounced “ma” in Mandarin, but each of them has a different tone pattern.
Because of aforementioned reasons, a detailed description of tone and intonation would be beneficial for many spoken language processing systems; i.e., to disambiguate words in automatic speech recognition, to detect different speech acts in dialog systems, to generate more naturally sounding speech in speech synthesis systems, etc. Hence, here, we focus on recognition of lexical tones in Mandarin and fine-grained intonation types, namely pitch accent and boundary tones, in English.
In the literature, for English, the detection of pitch accent and boundary tones has been largely explored; however the classification of pitch accent and boundary tone types hasn't been investigated much until recent years. Previously, prior art spoken language processing techniques combined pitch accent detection and classification tasks by creating a four-way classification problem with unaccented, high, low, and downstepped accent categories. More recently, prior art spoken language processing techniques have focused solely on the classification of pitch accent and boundary tone categories without the worry of detection.
In contrast to fine-grained intonation classification in English, lexical tone recognition in Mandarin has attracted attention of researchers for many years, and approaches can be grouped under two major categories: namely embedded and explicit tone modeling. In embedded tone modeling, tone related features are augmented to spectral features at each frame and is recognized as part of the existing system, whereas in explicit tone modeling, tones are independently modeled and recognized usually using supra-segmental features.
In traditional methods, including aforementioned work, prosodic features, such as pitch, duration, and energy features, were used for tone and intonation modeling. However, such traditional methods have not provided adequate tone recognition performance.
It is within this context that embodiments of the present invention arise.