1. Field of Invention
The present invention relates to a method for speech quality degradation estimation and a method for degradation measures calculation and apparatuses thereof. More particularly, the present invention relates to a method for speech quality degradation estimation applied to pitch-synchronous prosody modification and a method for degradation measures calculation and apparatuses thereof.
2. Description of Related Art
Text to speech synthesis technology has been developed for a long time and one of the most important factors for making speech sound natural is that the system must be able to synthesize speech with rich prosody. Presently, the major technology for modifying speech prosody is Time Domain Pitch Synchronous Overlap-and-Add (TD-PSOLA) technology. TD-PSOLA can modify the original prosody of speech, for example, modifying the first tone of Chinese to the fourth tone, and can produce synthesized speech of very good quality when degree of modification is limited within some range. However, if prosody of the source speech is very different from target prosody, TD-PSOLA may reduce the quality of the synthesized speech. In conventional technology, this problem is usually resolved by restricting the prosody modification to be within a fixed acceptable range, but there is no method to automatically predict the quality of the synthesized speech based on the source speech and the target prosody. Here, if a speech quality prediction mechanism can be added to estimate the synthesized speech quality, then the prosodies of different speech units can be modified appropriately within their tolerable speech quality ranges so that synthesized speech of high quality and high fidelity can be produced.
From another point of view, the existing major text to speech synthesis technology is corpus-based speech synthesis, wherein suitable speech units are chosen from a previously gathered speech database based on the target speech and these speech units are concatenated to synthesize speech of high quality. To synthesize high quality speech, the database should be large enough to contain all kinds of tones and prosodies such as excitement, sadness, calmness etc; thus, the required memory space is very large. Here, if suitable speech units are properly chosen from the large corpus and a speech quality estimation mechanism is added for determining which target speech unit can be synthesized by modifying another speech unit with a prosody modification method, then this target speech unit can be deleted from the original corpus. Because the speech quality of these synthesized target speech units can be restricted to be within an acceptable range through a speech quality estimation mechanism, the corpus size can be reduced without quality degradation.
Thus, a method of estimating prosody-modified speech is required, and to be applied broadly, this method has to be objective and automatic, that is, no human intervention is required during prediction or estimation. In order to be applied to real-time text to speech synthesis, this method preferably needs not to synthesize the target speech for predicting speech quality. However, all the existing technologies are not satisfying. First, in current text to speech synthesis field, there is no objective method for estimating the speech quality of a speech unit which is modified by a prosody modification method, only the continuities at concatenation points of speech units can be estimated. As to speech coding and transmission field, neither the Perceptual Speech Quality Measure (PSQM) nor the Perceptual Evaluation of Speech Quality (PESQ) suggested by the International Telecommunication Union (ITU) is suitable for estimating the quality of a speech which is modified by a prosody modification method, because both methods estimate the differences between spectra, but the spectrum of the modified speech is always changed regardless the quality of the synthesized speech.
U.S. Pat. No. 5,664,050 discloses a speech quality degradation estimation method. According to this method, first, a speech recognition system is set up and a test utterance produced by a speaker is input into the speech recognition system to obtain a reference score, then the synthesized speech is input into the system to obtain another score, the closer the two scores are, the better the quality of the synthesized speech is. The disadvantage of this method is that the target speech waveform has to be synthesized, and there is also a problem with the speech quality estimation standard thereof because scores from recognition models may not correspond to speech quality, synthesized speech of low score only means that the acoustic distance between the model and the synthesized speech is larger, but may not mean that the speech quality is not good.
The latest conventional technology disclosed is from a paper of E. Klabbers and J. P. H. van Santen, Center of Spoken Language Understanding, OGI, Eurospeech'03 (hereinafter “OGI”). The steps in the paper include: first, calculating the objective quality measures based on the distance between the pitch contours of the source speech and the target speech, and then inputting the objective quality measures into the regression model for calculating the objective speech quality scores. According to this method, even though objective estimation can be done without speech synthesis, however, how the prosody modification method performs prosody modification on the speech waveform is not considered, and only a fixed length of pitch sequence is respectively interpolated on the pitch contour of the source speech and the target speech for point to point distance calculation, thus, the objective speech quality scores thereof still cannot be used for accurately predicting the speech quality.