1. Field of the Invention
The present invention pertains to pattern recognition. More particularly, this invention relates to tone-sensitive acoustic modeling for speech recognition.
2. Background
In acoustic modeling, Markov models (MMs) are often used. When a MM system is built, each unit (e.g., word, syllable, phrase, etc.) in the recognizable vocabulary is defined as a sequence of sounds, or a fragment of speech, that resembles the pronunciation of the unit. A MM for each fragment of speech is created. The MM, for each of the sounds are then concatenated together to form a sequence of MMs that depict an acoustical definition of the unit in the vocabulary. For example, in FIG. 1A a phonetic word 100 for the word "CAT" is shown as a sequence of three phonetic Markov models, 101-103. One of the phonetic Markov models represents the phoneme "K" (101), having two transition arcs 101A and 101B. A second of the phonetic Markov models represents the phoneme "AH" (102), having transition arcs 102A and 102B. The third of the phonemes 103 represents the phoneme "T" having transition arcs 103A and 103B.
Each of the three Markov models shown in FIG. 1A have a beginning state and an ending state. The "K" model 101 begins in state 104 and ends in state 105. The "AH" model 102 begins in the state 105 and ends in state 106. The "T" model 103 begins in state 106 and ends in state 107. During recognition, an utterance is compared with the sequence of phonetic Markov models, starting from the leftmost state, such as state 104, and progressing according to the arrows through the intermediate states to the rightmost state, such as state 107, where the model 100 terminates in a manner well-known in the art. The transition time from the leftmost state 104 to the rightmost state 107 reflects the duration of the word. Therefore, to transition from the leftmost state 104 to the rightmost state 107, time must be spent in the "K" state, the "AH" state, and the "T" state to result in a conclusion that the utterance is the word "CAT". Thus, a MM for a word is comprised of a sequence of models corresponding to the different sounds made during the pronunciation of the word.
Construction of MMs for other units, such as syllables or phrases, is analogous to the above discussion. That is, a MM analogous to the model 100 could be generated for any desired unit, such as syllables or phrases.
Each of the three Markov models shown in FIG. 1A represents a phoneme ("K", "AH", and "T"). These phonemes are often made up of multiple states. For example, the phoneme "K" (101) may actually be comprised of three different states, as shown by phonetic model 109 in FIG. 1B. These three states, 114, 115 and 116 represent the phoneme states K.sub.1 (110), K.sub.2 (111) and K.sub.3 (112), respectively. Combined together, the phoneme states K.sub.1 (110), K.sub.2 (111) and K.sub.3 (112) represent the phoneme "K". Multiple arcs are shown connecting the three states 114-116, analogous to the arcs connecting phonemes 101-103 in FIG. 1A.
In order to build a Markov model such as those described in FIGS. 1A and 1B, a pronunciation dictionary is often used to indicate the component sounds. A wide variety of dictionaries exist and may be used. The source of information in these dictionaries is usually a phonetician. The component sounds attributed to a particular unit as depicted in the dictionary are based on the expertise and senses of the phonetician. Since phoneticians are human, pronunciations may differ from one phonetician to the next, or errors may exist in the dictionary. Furthermore, phonetic models such as model 109 of FIG. 1B are based on the of the phonetician; phonetic models may not actually represent the unit sought to be depicted. For example, a phonetician may believe the phoneme "K" should transition through the states shown in FIG. 1B, however, when an input utterance is aligned to the Markov models, it may be discovered that the actual utterance desires to transition through the states 124, 125 and 126 representing the fenones of K.sub.1, K.sub.2, and dH.sub.3, as shown in FIG. 1C. Such an alignment, however, is not possible using the phonetic model 109 of FIG. 1B because from state 115, the model transitions to either state 115 or state 116. The option to transition to a state for dH.sub.3 does not exist.
Fenonic models can be used to resolve this problem. Fenonic models are generally created by having a speaker read training data into the system. The training data is aligned to generate fenonic models based on the actual acoustic data obtained from the speaker, rather than on the expertise of a phonetician. This approach is often referred to as a data-driven approach. The use of fenonic models in a data-driven approach allows the system to build Markov models which actually represent the data being depicted, such as the fenonic model 120 shown in FIG. 1C.
Fenonic models can be used to generate larger phonetic models (or syllable models, or word models, etc.). This is done by combining the fenonic models. For example, the "K" phoneme (101) of FIG. 1A may be replaced by the fenonic model 120 shown in FIG. 1C.
One concern raised in acoustic modeling systems is the existence of certain languages which are tone-dependent, or tone-sensitive. Many languages are tone-insensitive, such as English. That is, words in English generally have the same meaning regardless of the tone they are spoken with.
However, other languages, such as Mandarin, are tone-sensitive. For example, the symbol "ma" in Mandarin has two distinct meanings dependent on whether it is spoken in a monotone or with a rising pitch.
Thus, it would be advantageous to provide a system which accurately modeled tone-sensitive languages. The present invention provides such a system.
One prior art method of modeling tone-sensitive languages utilizes a two-step process. In the first process the method attempts to determine what syllable or word (or similar unit) was spoken. In the second process, the method attempts to determine the pitch of the syllable or word (or similar unit) that was spoken. The results of these two processes are then combined in an attempt to recognize what was spoken. Thus, two distinct recognition processes are utilized: the first attempts to recognize the syllable or words while the second process attempts to recognize the tone of that syllable or word.
This type of two-process method has several disadvantages. First, additional system time and resources are required due to the additional time involved in running two separate processes. In addition, multiple-process systems suffer in performance as they find a local optimum solution for each process and then combine them, as opposed to finding a unified global optimum result.
Another prior art method of modeling tone-sensitive languages utilizes the same system for tone-dependent languages as used for tone-independent languages. Proponents of this type of method often claim that a specialized tone-dependent system is not necessary, arguing that tone-independent systems recognize tone-sensitive languages efficiently without modification. However, such systems generally produce an unacceptably high error rate.
The present invention provides a solution to the problems of the prior art.