This invention generally relates to a method and apparatus for converting the voice characteristics of synthesized speech to obtain modified synthesized speech from a single source thereof having simulated voice characteristics pertaining to the apparent age and/or sex of the speaker such that audible synthesized speech having different voice sounds with respect to the audible synthesized speech to be generated from the original source thereof may be produced.
In a general sense, speech analysis researchers have understood that it is possible to modify the acoustical characteristics of a speech signal so as to change the apparent sexual quality of the speech signal. To this end, the article "Speech Analysis and Synthesis by Linear Prediction of the Speech Wave"--Atal and Hanauer, The Journal of the Acoustical Society of America, Vol. 50, No. 2 (Part 2), pp. 637-650 (April 1971) describes the simulation of a female voice from a speech signal obtained from a male voice, wherein selected acoustical characteristics of the original speech signal were altered, e.g. the pitch, the formant frequencies, and their bandwidths.
In another more detailed approach, the publication "Speech Sounds and Features"--Fant, published by The MIT Press, Cambridge, Mass., pp. 84-93 (1973) sets forth a derived relationship called k factors or "sex factors" between female and male formants, and determined that these k factors are a function of the particular class of vowels. Each of these two early approaches requires a speech synthesis system capable of employing formant speech data and could not accept speech encoding schemes based on some speech synthesis technique other than formant synthesis.
While the conversion of voice characteristics of synthesized speech to produce other voice sounds having simulated voice characteristics pertaining to the apparent age and/or sex of the speaker differing from the voice characteristics of the original synthesized speech offers versatility in speech synthesis systems, heretofore only limited implementation of this general approach has occurred in speech synthesis systems.
A voice modification system relying upon actual human voice sounds as contrasted to synthesized speech and changing the original voice sounds to produce other voice sounds which may be distinctly different from the original voice sounds is disclosed and claimed in U.S. Pat. No. 4,241,235 McCanney issued Dec. 23, 1980. In this voice modification system, the voice signal source is a microphone or a connection to any source of live or recorded voice sounds or voice sound signals. Such a system is limited in its application to usage where direct modification of spoken speech or recorded speech would be acceptable and where the total speech content is of relatively short duration so as to entail significant storage requirements if recorded.
One technique of speech synthesis which has received increasing attention in recent years is linear predictive coding (LPC). In this connection, linear predictive coding offers a good trade-off between the quality and data rate required in the analysis and synthesis of speech, while also providing an acceptable degree of flexibility in the independent control of acoustical parameters. Speech synthesis systems having linear predictive coding speech synthesizers and operable either by the analysis-synthesis method or by the speech synthesis-by-rule method have been developed heretofore. However, these known speech synthesis systems relying upon linear predictive coding as a speech synthesis technique present difficulties in adapting them to perform rescaling or other voice conversion techniques in the absence of formant speech parameters. The conversion from linear predictive coding speech parameters to formant speech parameters to facilitate voice conversion involves solving a nonlinear equation which is very computation intensive.
Text-to-speech systems relying upon speech synthesis have the potential of providing synthesized speech with a virtually unlimited vocabulary as derived from a prestored component sounds library which may consist of allophones or phonemes, for example. Typically, the component sounds library comprises a read-only-memory whose digital speech data representative of the voice components from which words, phrases and sentences may be formed are derived from a male adult voice. A factor in the selection of a male voice for this purpose is that the male adult voice in the usual instance offers a low pitch profile which seems to be best suited to speech analysis software and speech synthesizers currently employed. A text-to-speech system relying upon synthesized speech from a male voice could be rendered more flexible and true-to-life by providing audible synthesized speech with varying voice characteristics depending upon the identity of the characters in the text (i.e., whether male or female, child, teenager, adult or whimsical character, such as a "talking" dog, etc.). Storage limitations in the read-only-memory serving as the voice component sound library render it impractical to provide separate sets of digital speech data corresponding to each of the voice characteristics for the respective "speaking" characters in the text material being converted to speech by speech synthesis techniques.