The present invention relates generally to the field of speech processing systems such as speech recognizers and text-to-speech converters. More specifically, the present invention relates to modeling units or set design, used in such systems.
Selecting the most suitable units, i.e. modeling units, to represent salient acoustic and phonetic information for a language is an important issue in designing a workable speech processing system such as a speech recognizer or text-to-speech converter. Some important criteria for selecting the appropriate modeling units include how accurate the modeling units can represent words, particularly in different word contexts; how trainable is the resulting model and whether parameters of units can be estimated reliably with enough data; and whether new words can be easily derived from the predefined unit inventory, i.e., whether the resulting model is generalizable.
Besides the overall factors to consider as provided above, there are several layers of units to be considered: phones, syllables and words. Their performances in term of the above criteria are very different. Word-based units should be a good choice for domain specific, such as a speech recognizer designed for digits. However, for LVCSR (large vocabulary, continuous speech recognizer), phone-based units are better since they are more trainable and generalizable.
Many speech processing systems now use context-dependent phones, like tri-phones, in the context of a state-sharing technology, e.g. Hidden Markov Modeling. The resulting systems have yielded good performance, particularly for western languages such as English. This is due in part to the smaller phone set of the western languages (e.g. English comprises only about 50 phones), which when modeled as context-dependent phones, like tri-phones, although theoretically would entail 503 different tri-phones, practically such systems use less and are considered both trainable and generalizable.
Although phone-based systems such as systems based on Hidden Markov Modeling of triphones has been shown to work well with western languages like English, speech processing systems for tonal languages like Chinese have generally used syllables as the basis of the modeled unit. Compared with most western languages, there are several distinctive characteristics or differences of a tonal language such as Chinese Mandarin. First, the number of words is unlimited, while number of characters and syllables are fixed. Specifically, one Chinese character corresponds to one syllable. In total, there are about 420 base syllables and more than 1200 tonal ones.
Since Chinese is a tonal language, for each syllable, there are usually five tone types from tone 1 to tone 5, like {/ma1/ /ma2/ /ma3/ /ma4/ /ma5/}. Among the 5 tones, first four ones are normal tones, which have the shape of High Level, Rising, Low level and Falling. The fifth tone is a neutralization of the other four. Although the phones are the same, the real acoustic realizations are different because of the different tone types.
In addition to the 1-1 mapping between character and syllable, a defined structure exists inside the syllable. Specifically, each base syllable can be represented with the following form:(C)+(G) V (V, N)According to Chinese phonology, the first part before “+” is called initials, which mainly consists of consonants. There are 22 initials in Chinese and one of it is a zero initial, representing the cases when initials are absent. Parts after “+” are called finals. There are about 38 finals in Mandarin Chinese. Here (G), V and (V, N) are called head (glide), body (main vowel) and tail (coda) of finals respectively. Units in brackets are optional in constructing valid syllables.
As mentioned above, syllables have generally formed the basis of the modeled unit in a tonal language such as Mandarin Chinese. Such a system has generally not been used for western languages because of thousands of possible syllables exist. However, such representation is very accurate for Mandarin Chinese and the number of units is also acceptable. However, the number of tri-syllables is very large and tonal syllables make the situation even worse. Therefore, most of the current modeling strategies for Mandarin Chinese are based on the decomposition of syllable. Among them, syllables are usually decomposed into initial and final parts, while tone information is modeled separately or together with final parts. Nevertheless, shortcomings still exist with these systems and an improved modeling unit set is certainly desired.