The present invention relates to speech synthesizers and in particular to an improved phoneme based speech synthesizer that is capable of producing high quality speech and yet is inexpensive to manufacture and requires a low data bit rate.
Since human speech is an analog process, it is not surprising that most speech synthesizers heretofore developed have been analog synthesizers. While successful high quality analog synthesizers have been designed, it has generally been recognized that a digital system capable of producing comparable speech quality would be preferred because of the reliability, size and cost advantages associated with digital circuitry. Towards this end, a voice compression technique called linear predictive coding (LPC) has been recently developed which utilizes a digital filter to model the human vocal tract. While this approach to speech synthesis appears to have promise, LPC systems are typically quite complex and require relatively high input data rates to produce quality speech. Consequently, the above noted advantages of such a digital synthesizer are compromised.
The present speech synthesizer utilizes a novel highly simplified analog vocal tract which requires only four control parameters to produce high quality speech, and drives the vocal tract with a completely digital control system. The result is a speech synthesizer that is highly simplified, exceptionally cost effective, and yet is capable of producing a level of speech quality that duplicates or exceeds the most sophisticated designs presently available. Moreover, since the present speech synthesizer is a phoneme based synthesizer, the input data rate required to drive the system is very low.
The vocal tract used in the present system comprises an analog delay line (ADL) which accurately simulates the characteristics of the human vocal tract. Unlike conventional speech synthesizers which employ vocal tracts comprising a plurality of cascaded or parallel connected resonant filters, the present ADL vocal tract comprises a single interactive bilateral filter network.
In general, the design of the ADL vocal tract is based upon the electronic model of the human vocal tract which simulates the effects of changing vocal tract geometry. It has long been known that the acoustical characteristics of the human vocal tract are varied by changes in the cross-sectional area of the vocal tract at different points along its length. In this respect, the human vocal tract exhibits the acoustical characteristics of an acoustic tube whose cross-sectional dimensions are small relative to the wavelengths of the frequencies generated. An acoustical system of this type can be represented electrically by a plurality of T-sections whose series element is an inductance and whose shunt element is a capacitance. Each stage thus represents a given length of the acoustic system as determined by the number of stages utilized. Accordingly, it will be appreciated that the effective cross-sectional area of each section can be electrically adjusted by varying the impedance of the components.
However, while the merits of the theoretical electrical model have been recognized, practical implementation of the model has proved to present significant design problems. Specifically, attempts at designing an electrically controllable inductive/capacitive network have resulted in systems of extreme complexity. Moreover, due to the inherent imperfections associated with the many circuit approximations required, much of the desired characteristics of the theoretical electrical model are lost. Consequently, the quality of the speech produced thereby is compromised. Thus, despite its initial promise a practical ADL vocal tract has yet to be produced.
The speech synthesizer of the present invention provides a novel approach to the implementation of an ADL vocal tract. Specifically, rather than attempting to provide directly eletrically variable inductances and capacitances, the present invention utilizes time domain equivalents of these components. Specifically, in the domain, inductances and capacitances are 180.degree. out of phase. Therefore, if the time domain reference is rotated 90.degree., it will be appreciated that an inductance can be represented by a resistance and a capacitance can be represented by a negative resistance. Although there is in reality no such thing as a negative resistance, there has recently been developed a circuit that simulates the characteristics of a negative resistance. This circuit is called a frequency dependent negative resistance or "FDNR". Thus, by utilizing ordinary resistors as the series components and FDNR's as the shunt components, the present invention provides an ADL modeled vocal tract comprised of components which can be practically tuned. In the preferred embodiment described hereinafter, the vocal tract comprises five "LC" sections with four tuning elements. Thus, only four control signals are required. However, as will be appreciated by those skilled in the art, the vocal tract can be readily modified to include additional stages and additional tuning elements if desired.
Except for the vocal oscillator circuit, the balance of the present speech synthesizer is comprised entirely of digital circuitry. Thus, unlike prior art analog speech synthesizers, the present system is remarkably small in size and exceptionally inexpensive to manufacture. The speech synthesizer of the present invention is driven by a 12-bit digital input command word. Six of the bits in the input command word identify the particular phoneme to be produced, two of the bits establish the inflection level, and the remaining four input bits determine the speech rate of the audio output. The six phoneme select bits are provided to a read-only-memory (ROM) circuit that is adapted to produce a plurality of parameter control signals which electronically define the particular phoneme identified. The control signals produced can be divided into three groups: the reflection coefficient parameters, the excitation parameters, and the timing parameters. The timing parameters, along with the four speech rate input bits are provided to a timing network which controls individual phoneme timing, transition timing, and overall speech rate.
The reflection coefficient parameters, which electronically tune the vocal tract, and the excitation parameters, which control the injection of voiced and fricative excitation energy into the vocal tract as well as control the spectral shape of the speech output waveform, are provided through novel digital transition circuitry which serves to smooth the abrupt variations that occur in the values of the control signals from phoneme-to-phoneme. The transition functions in the preferred embodiment are generated by a pair of random access memory (RAM) units under the control of the timing network. More particularly, the control signal parameters from the input ROMs are generated over a predetermined time period referred to as a time "frame". Each time frame is then divided by the timing network into four binary weighted bit intervals, each comprising a predefined number of time slots. For each of the reflection coefficient and excitation control signal parameters, there is dedicated in the RAM units the appropriate number of memory address locations corresponding to the total number of time slots in the four bit intervals of the time frame. When the value of a control signal parameter changes, the new value is "written" substantially simultaneously into four memory locations at a time, corresponding to one time slot in each bit interval, at a rate determined by the timing network. Accordingly, it will be seen that when a new control signal is produced by the ROM input units indicating the beginning of the next phoneme, the appropriate address locations in the RAM transition units are gradually updated to the new value. In this manner, the control signal value produced at the output of the RAM units also changes gradually from its previous value to the new value. Thus, it will be appreciated that the present speech synthesizer accomplishes the smooth dynamic variations between phonemes that characterize human speech through the exclusive use of digital circuitry.
Additional objects and advantages of the present invention will become apparent from a reading of the following detailed description of the preferred embodiments which makes reference to the following set of drawings in which: