This invention relates to the field of synthesized speech. In particular, the invention relates to statistical enhancement of synthesized speech output from a statistical text-to-speech (TTS) synthesis system.
Synthesized speech is artificially produced human speech generated by computer software or hardware. A TTS system converts language text into a speech signal or waveform suitable for digital-to-analog conversion and playback.
One form of TTS system uses concatenating synthesis in which pieces of recorded speech are selected from a database and concatenated to form the speech signal conveying the input text. Typically, the stored speech pieces represent phonetic, units e.g. sub-phones, phones, diphones, appearing in certain phonetic-linguistic context.
Another class of speech synthesis, referred to as “statistical TTS”, creates the synthesized speech signal by statistical modeling of the human voice. Existing statistical TTS systems are based on hidden Markov models (HMM) with Gaussian mixture emission probability distribution, so “HMM TTS” and “statistical TTS” may sometimes be used synonymously. However, in principle a statistical TTS system may employ other types of models. Hence the description of the present invention addresses statistical TTS in general while HMM TTS is considered a particular example of the former.
In an HMM-based system the frequency spectrum (vocal tract), fundamental frequency (vocal source), and duration (prosody) of speech may be modeled simultaneously by HMMs. Speech waveforms may be generated from HMMs based on the maximum likelihood criterion.
HMM-based TTS systems have gained increased popularity in the industry and speech research community due to certain advantages of this approach over the concatenative synthesis paradigm. However, it is commonly acknowledged that HMM TTS systems produce speech of dimmed quality lacking crispiness and liveliness that are present in natural speech and preserved to a big extent in concatenative TTS output. In general, the dimmed quality in HMM-based systems is accounted to spectral shape smearing and in particular to formants widening as a result of statistical modeling that involves averaging of vast amount (e.g. thousands) of feature vectors representing speech frames.
The formant smearing effect has been known for many years in the field of speech coding, although in HMM TTS this effect has stronger negative impact on the perceptual quality of the output. Some speech enhancement techniques (also known as, postfiltering) have been developed for speech codecs in order to compensate quantization noise and sharpen the formants at the decoding phase. Some TTS systems follow this approach and employ a post-processing enhancement step aimed at partial compensation of the spectral smearing effect.