This invention generally relates to a method and apparatus for altering the voice characteristics of synthesized speech to obtain modified synthesized speech of any one of a plurality of voice sounds from a single applied source of synthesized speech, wherein audible synthesized speech may be generated from the original source of synthesized speech having a voice quality significantly different and affecting the apparent age and/or sex attributed to the supposed person speaking. In particular, a plurality of voice sounds of apparently non-human origin and of fanciful or whimsical quality such as speaking animals, birds, monsters etc. are producible from a single source of synthesized speech by effecting a simulated adjustment in the sampling period of the digital speech data from the source of synthesized speech to alter the vocal tract model of the digital speech data to a preselected degree without affecting the pitch period and the speech rate implicit in the original source of synthesized speech.
Generally, speech analysis researchers have appreciated the possibility of changing the acoustical characteristics of a speech signal in a manner altering the apparent voice characteristics associated with the speech signal. In this respect, the article "Speech Analysis and Synthesis by Linear Prediction of the Speech Wave" -Atal and Hanauer, The Journal of the Acoustical Society of America, Vol. 50, No. 2 (Part 2), pp. 637-650 (April 1971) describes the simulation of a female voice from a speech signal obtained from a male voice, wherein selected acoustical characteristics of the original speech model were altered, e.g. the pitch, the formant frequencies, and their bandwidths.
Fant in the publication, "Speech Sounds and Features", published by The MIT Press, Cambridge, Mass., pp. 84-93 (1973) describes a derived relationship called k factors or "sex factors" between female and male formants in suggesting that these k factors are a function of the particular class of vowels.
In addition, U.S. Pat. No. 4,241,235 McCanney issued Dec. 23, 1980 discloses a voice modification system which relies upon actual human voice sounds as contrasted to synthesized speech, wherein the original voice sounds are changed to produce other voice sounds distinctly different from the original voice sounds. In this voice modification system, the voice signal source is a microphone or a connection to any source of live or recorded voice sounds or voice sound signals. This type of voice modification system is limited in application to situations where direct modification of spoken speech or recorded speech would be acceptable and where the total speech content is of relatively short duration so as not to require significant storage requirements if recorded.
One technique of speech synthesis which has received increasing attention in recent years is linear predictive coding (LPC). It has been found that linear predictive coding offers a good trade-off between the quality and data rate required in the analysis and synthesis of speech, while also providing an acceptable degree of flexibility in the independent control of acoustical parameters.
Text-to-speech systems relying upon speech synthesis have the potential of providing synthesized speech with a virtually unlimited vocabulary as derived from a prestored component sounds library which may consist of allophones or phonemes, for example. Typically, the component sounds library comprises a read-only-memory whose digital speech data representative of the voice components from which words, phrases and sentences may be formed are derived from a male adult voice. A factor in the selection of a male voice for this purpose is that the male adult voice in the usual instance offers a low pitch profile which seems to be best suited to speech analysis software and speech synthesizers currently employed. The provision of audible synthesized speech with varying voice characteristics depending upon the identity of the characters in the text of a text-to-speech system relying upon synthesized speech from a male voice could be rendered more flexible without requiring any increase in memory storage by altering the voice characteristics of the original source of synthesized speech to produce a plurality of voice sounds of different speech character depending upon the identity of the characters in the text. In this respect, copending U.S. patent application Ser. No. 375,434 filed May 6, 1982, now U.S. Pat. No. 4,624,012 issued Nov. 18, 1986, discloses a method and apparatus for converting the voice characteristics of synthesized speech as obtained from a single applied source of synthesized speech. The technique for converting the voice characteristics of synthesized speech as disclosed in the latter U.S. application, now U.S. Pat. No. 4,624,012relies upon separating the pitch period, the vocal tract model, and the speech rate as contained in the source of synthesized speech into the respective speech parameters, with the values of pitch and the speech data rate being then varied in a preselected manner as determined by a selected change in the sampling rate while the vocal tract model is retained in its original form. The changed speech data parameters are then recombined with the original vocal tract model to create a modified synthesized speech data format having different voice characteristics with respect to the synthesized speech from the source. Thus, the technique described in the aforesaid U.S. application Ser. No. 375,434 filed May 6, 1982, now U.S. Pat. No. 4,624,012, in its preferred form involves actual changing of the sampling rate, with the modified sampling rate being employed with the original pitch period data and the original speech rate data in the development of a modified pitch period and a modified speech rate for re-combining with the original vocal tract speech parameters in producing the modified speech data format from which audible synthesized human speech may be generated via a speech synthesizer and an audio means having different voice characteristics from the synthesized human speech which would have been obtained from the original source of synthesized speech.