The present invention relates to speech processing and, more particularly, to a statistical method and apparatus for performing pitch extraction in speech recognition, synthesis and regeneration.
It is known that pitch extraction has been an essential part of speech signal processing for decades. Typically, pitch extraction is used in three fields: speech regeneration, text-to-speech synthesis and speech recognition. In speech regeneration, pitch is an essential element in regenerating pleasant sounding speech. In text-to-speech synthesis, pitch is currently generated by discrete rules, for instance, which syllables are high pitch or low pitch. It is also known that pitch may be synthesized from text by statistical methods using the pitch data from real speech to create a database containing correlations between pitch contour and text. In speech recognition, particularly for tonal languages such as those belonging to the Sino-Tibetan stock of languages, pitch is a necessity. For non-tonal languages, a good pitch parameter may improve recognition accuracy and speed.
In spite of the intensive studies in the last decades, a totally reliable pitch extraction method is still lacking. The reasons for such deficiency in pitch extraction methods is substantially due to at least two reasons. The first is due to the conventional definition of pitch, while the second is due to the deterministic conventional method of pitch extraction. Traditionally, pitch is defined as the fundamental frequency of the voiced sections of a speech signal. This very definition causes problems. First, the distinction between voiced and unvoiced (i.e., whispered) sections of speech is not black and white (i.e., discrete). There are always transitions between a typical voiced section and a typical unvoiced section. That is, there are different degrees of clarity between such sections. Second, since speech is not a periodic phenomenon, the concept of fundamental frequency is not valid in the original sense.
Based on the above definition, the usual method of pitch determination is: (a) separate the silence from the speech signal; (b) separate the voiced part and the unvoiced part of the speech signal (note that the determination between voiced and unvoiced sections of the speech is usually treated as a yes-or-no question, that is, a frame of speech must be labelled either as voiced or unvoiced and only for the voiced frames are pitch extraction procedures applied); and (c) extract the pitch from a section of the voiced speech signal as if it is a periodic phenomenon. Three methods are typically used for pitch extraction: autocorrelation, cepstrum, and subharmonic summation, See, for example, Wolfgang J. Hess, "Pitch and Voicing Determination", in "Advances in Speech Signal Processing", edited by Sadaoki Furui and M., Mohan Sondhi, Marcel Dekker, Inc., New York, 1991, pp 3-48; Wolfgang J. Hess, "Pitch Determination of Speech Signals Algorithms and Devices" Springer-Verlag, Berlin, 1983; and L. R. Rabiner and R. W. Schafer, "Digital Processing of Speech Signals". For example, using the autocorrelation method, the most prominent peak is identified with pitch. However, in many cases, the peak is at frequencies other than the expected pitch. In fact, if a signal contains sinusoidal components of fundamental frequency .function..sub.0 and its harmonics n.function..sub.0, where n is an integer, then the peaks in the autocorrelation function can be at all of the following times: ##EQU1## where m is another integer. Other methods have the same problem. Thus, the pitch values obtained by conventional methods are not substantially accurate. For the three speech applications mentioned above (i.e., speech recognition, synthesis and regeneration), a continuous and accurate pitch curve is required. However, none of the conventional methods described above satisfactorily meet such requirements.
It is observed that even by whispering, the tones of tonal languages, e.g., Mandarin Chinese, can be recognized by the human ear. For non-tonal languages, the prosodic pitch contour can also be expressed and perceived. Also, it is to be appreciated that for speech recognition, the derivative of pitch is as important as the absolute value of pitch. At the boundaries of voiced speech (assuming the boundaries are defined properly) and unvoiced speech, continuity must be preserved. Thus, it would be highly desirable and advantageous to define a pitch which exists at all times, not only during voiced portions of speech.