There are many text to speech devices in the prior art. As can be verified in the literature related to the prior art, it has been generally accepted that since the energy of typical human speech is distributed over a frequency spectrum of 5,000 hertz, a sampling rate of 10,000 samples per second (or twice the upper frequency value of the accepted human speech frequency spectrum) provides sufficient points, or ordinate lengths, to generate an accurate analog waveform to represent a spoken version of the text. In fact such sampling does provide an analog waveform to represent the spoken version of the text, but if the imitated speaker is a female, with a relatively high pitched voice, then the imitation speech generated by prior art devices is of poor quality.
It is well understood in the speech simulation art that the sounds which are developed by opening and closing human vocal chords, (called voiced sounds as compared to aspiration sounds and frication sounds) have a fundamental frequency in the range of 50 cps to 400 cps. The speech of a typical female, having a somewhat high pitched voice, in all probability emanates, at least in part, from vocal chords opening and closing with a frequency of somewhere between 160 cps to 400 cps. In considering the simulation of female speech, I have found that if a digitized glottal waveform, which is to be ultimately transformed into an analog signal, is sampled (for ultimate transformation into an analog signal) at the traditional rate of 10,000 samples per second and that waveform has been developed to provide a major component in an imitation of female speech, the resulting female speech is of poor quality. I have further found that if the digitized, glottal waveform is generated so as to provide enough information (temporal accuracy in specification of fundamental frequency) to provide 40,000 samples per second, such a waveform provides the basis for improving the quality of the female speech being generated. Since the digital signal processor, used to generate the digitized glottal waveform, is limited in its ability to perform digital filtering at sample rates above 10,000 samples per second, the digitized glottal waveform (having information sufficiency to provide 40,000 samples per second) must be down sampled to the rate of 10,000 samples per second. In order to preserve some of the advantages of increased information, the present system low pass filters the waveform to remove high frequency signal components and to provide a desirable averaging operation before sampling at the lower rate. Accordingly, the system provides the resulting waveform at 10,000 samples per second to be combined by software with waveforms from other sound sources. The down sampled waveform nonetheless has been the basis for very much improved quality of the generated female speech and slightly improved quality of the male speech.