1. Technical Field
The present disclosure relates to speech synthesis and more specifically to detecting and correcting abnormal stress patterns in synthetic speech.
2. Introduction
Spoken English and numerous other spoken languages include stress patterns which “sound” natural to native speakers. In some instances, stress patterns can disambiguate otherwise confusable words, such as 'ad-dict (an addicted person) and ad-'dict (to make someone dependent on something). Foreign speakers often pronounce the correct sequence of sounds or phones, but with the wrong stress pattern, making their speech difficult to recognize for native speakers. Foreign speakers are often not aware of specific stress patterns in English words and therefore stress the wrong syllables. For example, English has strong-weak alternating rhythm and each word has its own specific stress pattern. Similarly, a text-to-speech (TTS) synthesis system sometimes produces incorrect stress patterns, which makes a TTS system sound like a foreign speaker. An incorrect stress pattern is not only disruptive by itself, but also degrades intelligibility and naturalness of TTS synthesis.
Previous work related to stress in speech synthesis has concentrated on stress assignment to predict the correct stress patterns from given text. Traditional parametric speech synthesis produces a stream of parameters from rules or from statistics based on a training corpus. Unit selection synthesis, which can produce higher quality speech by concatenating natural speech segments with less signal processing, brings an unexpected complication. Acoustic units chosen from various locations throughout a recorded corpus and concatenated in novel combinations may convey the wrong lexical stress pattern even though the correct pattern was predicted by the TTS frontend. Accordingly, what is needed is improved handling of stress in speech synthesis.