The present invention relates to an improved electronic device for phonetically synthesizing human speech.
Until recently, development in this area had resulted in the production of only extremely complicated and costly devices that generated very unnatural sounding speech. This was primarily attributable to the fact that these first generation synthesizers, with virtually no prior development to build upon, attempted to design a synthesizer that was capable of performing substantially every known function of human speech. Consequently, the systems that resulted were capable of performing few functions satisfactorily.
Typical of the design approach of these early speech synthesizers was the treatment accorded the transitional periods between phonemes. In recognition of the importance of the transitional periods in human speech, some systems devoted substantial effort to the production of various transitional waveforms to simulate the actual human articulation between steady-state phoneme conditions. However, the highly complex circuitry required to analyze, control and integrate the production of these waveforms into smooth flowing phonetic speech made the systems highly impractical for commercial use. The complexity of these systems prompted subsequent research efforts to simplify the original systems.
The relatively recent developments in this area have essentially conceded the fact that the precise duplication of the human speech system is an unattainable goal, and have instead sought to design an approximation of the human speech system that will produce acceptable sounding speech. Without discounting the importance of interphoneme transitions, the principal result of this development has been the change from the highly complex system of interphoneme transitions previously discussed, to a simplified approach that employs relatively slow-acting filters that smooth the abrupt variations in the control parameters that determine the steady-state conditions of individual phonemes.
Accordingly, it is the primary object of the present invention to provide an improved speech synthesizer that not only is relatively uncomplicated and inexpensive, but is also capable of producing remarkably natural sounding speech. In addition, the present system is designed to be readily adaptable to a wide range of commercial uses.
Furthermore, it is another object of the present invention to provide a speech synthesizer that will produce very natural sounding speech without the aid of an experienced programmer. This makes the present system particularly adapted for use in connection with a digital computer as a text-to-audio converter.
The preferred embodiment of the present invention comprises a system that is adapted to convert digitized data, such as the output from a computer or other digital device, into electronically synthesized human speech by producing and integrating together the phonemes and allophones of speech. The basic digital command word which drives the present voice system preferably comprises twelve bits. Seven of these bits are allocated to phoneme selection to define a particular phoneme, pause or control function, thus providing a maximum of 2.sup.7 or 128 different commands. The increased capacity over that required to produce the basic phoneme sounds allows the present system to reproduce a greater variety of allophones which represent basic phonemes that are slightly altered to integrate more appropriately into the variability of speech. For example, the "ae" phoneme in the word "happen" is different than in the word "bat". Similarly the beginning "k" phoneme in the word "kick" is different than the "k" phoneme in the word "quit". In addition, the increased capacity permits the present system to devote various commands to the production of phonemes unique to certain foreign languages, thus providing the system with the capability of producing high quality foreign speech as well.
Three of the twelve data bits in the input command word are used for inflection control. This provides 2.sup.3 or eight different inflection levels per phoneme, which gives the system the ability to reproduce the smooth and subtle movements in pitch characteristic of human speech. The remaining two data bits in each input command word are used to vary the rate of phoneme production, thereby providing four possible time intervals for each phoneme produced, allowing phonemes to be more contextually precise in time duration.
The seven bits that define the particular phoneme are provided to an input control circuit which produces a plurality of predetermined control signal parameters that electronically define the phoneme selected. The control signals produced by the input control circuit are preferably in the form of serialized binaryweighted square wave signals whose average values are equivalent to the analog control signals they represent. By producing digital representations of analog signals, the present system avoids the necessity of employing complicated electronic circuitry required to accurately control analog signals.
The control signal parameters from the input control circuit are first passed through a series of relatively slow-acting transition filters which smooth the abrupt amplitude variations in the signals. From there, the control signals are provided to various dynamic articulation control circuits which combine and process the parameters to produce excitation control and vocal tract control signals analogous to the musle commands from the brain to the vocal tract, glottis, tongue and mouth in the human speech mechanism.
The system further includes vocal and fricative excitation sources which receive the excitation control signals that determine the various signal characteristics of the basic voiced and unvoiced signal quantities in human speech. The vocal excitation source produces a glottal waveform that mimics the glottis as it vibrates in the human vocal tract. The fricative source simulates the sound of air passing through a restricted opening as occurs in the pronounciation of the phonemes "s", "f" and "h".
The vocal and fricative excitation signals, as well as the vocal tract control signals, are supplied to a series of cascaded resonant filters which simulates the multiple resonant cavities in the human vocal tract. The control signals adjust the characteristic resonances of the filters to produce an audio signal having the desired frequency spectrum.
The two rate bits in the original input command word are converted to a duty cycle rate control signal that is provided to the phoneme clock which defines the time interval of the particular phoneme generated. The three remaining inflection bits in the input command word are used to generate an analog inflection control signal that is provided to the vocal excitation source to determine the "pitch" or frequency of the glottal waveform.
The preferred form of the present invention also includes circuitry that automatically alters the inflection levels of various phonemes in accordance with certain parameter control signals. As a result, the voice generated by the present system is less monotonic and more natural sounding than those of previous systems, especially when manual programming of inflection is impractical or not used.
In addition, the present invention utilizes a novel glottal waveform that more accurately simulates the actions of the human glottis. The new glottal waveform comprises a truncated sawtooth waveform which produces both odd and even harmonics. Also included in the glottal waveform is the addition of a high frequency formant that increases the spectral energy of the waveform at high frequencies. The increased energy at high frequencies improves the relative spectral amplitude of the lower formants as well.
The vocal tract of the present invention has also been improved by adding movement to the fourth order resonant filter in the vocal tract. This is particularly significant because it is accomplished without requiring the generation of additional control parameters that would increase the complexity of the system. Rather, the fourth resonant filter is made variable under the control of the same control signal that determines the location of the third resonant pole.
The present invention additionally incorporates into the vocal tract the suppression of vocal resonances to simulate the reduced impedance that is reflected in the human vocal tract when the glottis is opened. In particular, the present system includes a circuit that is adapted to produce a variable pulse-width square signal whose duty cycle is proportional to the magnitude of the glottal waveform. The glottal suppression duty cycle signal is then provided to a series of analog control gates connected across the bandpass sections of the first three resonant filters in the vocal tract. The effect is to dampen resonance due to open glottis by increasing the band-widths of the resonant filters as the magnitude of the glottal waveform increases.
Finally, the present invention includes a flag command decode and control circuit which provides the programmer with the ability to vary the overall volume and speech rate of the audio output. The circuit is also capable of introducing into the speech pattern a silent phoneme which is articulated in the same manner as a voiced phoneme to add to the naturalness of the speech generated. As will subsequently be described in greater detail, the silent phoneme is intended primarily for use in combination with certain phonemes which sound more natural if their articulation pattern is formed prior to, or maintained for a brief period after, the application of excitation energy to the vocal tract.
The flag circuit is designed to be activated by a specific 7-bit phoneme code that distinguishes the flag command from other phoneme commands. The remaining five bits in the flag command word are then used to select the sound level and speech rate desired, and to indicate whether the succeeding phoneme period is to be silent. In addition, the flag command phoneme is adapted to consume a very brief time interval so that the normal phonetic makeup of a message is not noticeably altered. This is accomplished by latching the desired flag in formation and commanding the synthesizer to immediately proceed to the next phoneme.
In reading the following detailed description of the preferred embodiment, however, it is to be understood that the practice of the present invention is not limited to the exact system described herein. Rather, the concepts of the present invention are equally applicable to other basic speech systems without departing significantly from the teachings of the present invention.