Languages such as Mandarin are tonal languages, in which the pronunciation of syllables generally has a number of (e.g., five) tones. The tones indicate the variation of voice pitch, which sometimes is very important discriminative information. However, in common ASR systems, the acoustic features usually ignore the tone and the pitch information is discarded, which is particularly a loss for speech recognition systems for tonal languages, especially small vocabulary recognition task, e.g. Chinese digit string recognition. Besides, such ASR systems cannot distinguish between word pairs that only differ in tones, i.e. homophonic words.
To improve the performance of ASR systems for tonal languages such as Mandarin, pitch features are extracted and combined with conventional acoustic features, e.g. MFCC, etc. There is a special problem in pitch extraction for ASR purpose, i.e. how to assign feature values in those unvoiced frames, e.g. consonants, in which there is no pitch information at all, in order to output a continuous feature stream. In general methods, random values are assigned in unvoiced frames as pitch features thereof. However, directly using random values will cause abnormal likelihood in decoding and consequently decrease the recognition performance.
Besides, in the extraction of pitch features, some intermedial parameters are also useful to improve the recognition performance, but are ignored in real applications.