1. Technical Field
The present invention is directed to the field of speech recognition. More specifically, the invention provides a speaker-independent speech recognition system and method for tonal languages in which a spectral score is combined, sequentially, with a tonal score to arrive at a best prediction for a spoken syllable.
2. Description of the Related Art
Recently, there have been many advancements in speech recognition systems. Most of these systems, however, are developed for western languages such as English, which are non-tonal, as distinguished from many eastern languages such as Chinese, which are tonal. In a tonal language, the tone of the speech is related to its meaning, and therefore it is insufficient to simply analyze the spectral content of the spoken syllable(s), as can be done in analyzing non-tonal languages. A tonal language typically has four to nine tones. For example, these tones are classified into “high,” “rising,” “dip,” or “falling” in Mandarin Chinese, which has four tones. Explicit recognition of these tones is difficult, however, since different speakers have different speaking characteristics. In languages such as Chinese, tones are characterized by features such as the fundamental frequency (F0) values and corresponding contour shapes. These values and shapes are difficult to capture and properly analyze for speaker-independent recognition because the absolute value of F0 varies greatly between speakers. For example, the high tone of a low-pitch speaker can be the same or similar to the low tone of a high-pitch speaker.
Several known speech recognition systems for tonal languages are described in CN 1122936, U.S. Pat. No. 5,787,230, CN 1107981, CN 1127898, U.S. Pat. No. 5,680,510, U.S. Pat. No. 5,220,639, WO 97/40491, WO 96/10248, and U.S. Pat. No. 5,694,520 Many of these systems, however, rely on the absolute value of the syllable's fundamental frequency (F0) in order to ascertain the proper tone, and thus fail to properly discriminate between speakers having differing tonal characteristics. These systems typically must be “trained” for a particular speaker prior to proper operation. In addition, each of these systems utilizes a parallel processing architecture that prohibits an integrated analysis of the spectral and tonal information, thus further limiting their usefulness in a speaker-independent application.