Text-to-speech systems generally include two parts; the first typically takes text as input and generates phonetic and prosodic sequences as output, and the second, the synthesis step, typically takes the phonetic and prosodic sequences as input and generates audio as output. Several efforts have historically been made in connection with the second part, but room for improvement continually exists.
Speech synthesis today is mainly done by one of two methods, either formant synthesis or concatenative speech synthesis. Formant systems are small, but require considerable tuning to achieve acceptable quality, and cannot be automatically matched to a reference voice. Concatenative systems can be automatically trained to match a reference voice, but must be quite large to provide acceptable quality, and require a complex dynamic programming process to generate the audio. A need has therefore been recognized in connection with providing an arrangement that is small, fast, and can be easily trained to match a reference voice.
U.S. Pat. No. 5,230,037 (“Phonetic Hidden Markov Model Speech Synthesizer”; Giustiniani et al.) relates to a system for speech synthesis that uses sequences of feature vectors chosen from a model set as the basis for synthesizing speech. The feature vectors, however, are computed by simple averaging over all instances for each model vector. This has the disadvantage of “smearing” the spectra, resulting in distorted audio upon generation.
Systems for altering voice characteristics, such as U.S. Pat. No. 4,624,012 (“Method and Apparatus for Converting Voice Characteristics of Synthesized Speech”; Lin et al.) and U.S. Pat. No. 5,113,449 (“Method and Apparatus for Altering Voice Characteristics of Synthesized Speech”; Blanton et al.) rely on modifications of the sampled audio to produce a voice that sounds different, but the types of differences are limited, and they cannot be directed to contain particular desired characteristics. The system for voice transformation discussed in U.S. Pat. No. 5,847,303 (“Voice Processor with Adaptive Configuration by Parameter”; Matsumoto) discusses subject matter similar to the Lin et al. and Blanton et al. patents, but uses a set of global parameters estimated from a target speaker to perform the transformation. Similarly to those patents, however, the changes are not specific to particular sounds, and so are limited.
Some systems for voice transformation use the spectral envelope of the source speaker together with the excitation signal component of the target individual to generate the target signal, for example, U.S. Pat. No. 5,165,008 (“Speech Synthesis Using Perceptual Linear Prediction Parameters”; Hermansky et al.) and U.S. Pat. No. 6,336,092 (“Targeted Vocal Transformation”; Gibson et al.) which, like Matsumoto, infra, discusses a limited global transformation.
In another system, spectral equalization is performed based on parallel utterances by the source and target speaker (U.S. Pat. No. 5,750,912, “Formant Converting Apparatus Modifying Singing Voice to Emulate Model Voice”; Matsumoto) but, here, novel utterances are not allowed for.
Other systems use sets of model vectors taken from individual instances of training data, for example, as discussed in U.S. Pat. No. 5,307,442 (“Method and Apparatus for Speaker Individuality Conversion”; Abe et al.), U.S. Pat. No. 5,327,521: “Speech Transformation System”; Savic et al.) and U.S. Pat. No. 6,463,412: “High Performance Voice Transformation Apparatus and Method”; Baumgartner et al.). As a result, the model vectors are subject to noise and variations in the reference speakers' performance, thereby degrading the smoothness of the generated audio.
Some voice coding systems also use model vectors taken from individual instances of training data, for example U.S. Pat. No. 5,696,879 (“Method and Apparatus for Improved Voice Transmission”; Cline et al.) and U.S. Pat. No. 5,933,805 (“Retaining Prosody during Speech Analysis for Later Playback”; Boss et al.); the same limitations as with Abe et al., Savic et al., and Baumgartner et al. (all supra) are thus apparent.
One method of voice morphing, as discussed in U.S. Pat. No. 5,749,073 (“System for Automatically Morphing Audio Information”; Slaney) uses a dynamic time warp to align parallel utterances which are interpolated using either cross-fading or a dynamic frequency warping. Cross-fading, however, does not blend the voices, but only overlaps them. Dynamic frequency warping does blend the voices, but the process is complex.
In view of the foregoing, a need has been recognized in connection with improving upon the shortcomings and disadvantages of prior efforts.