Devices that synthesize speech must be capable of producing all the sounds of the language of interest. There are 34 such sounds or phonemes in the General American Dialect, exclusive of diphthongs, affricates and minor variants. Examples of two such phonemes, the sounds /n/ and /s/, are given in FIGS. 1 and 2, in which the amplitude of the speech signal is presented as a function of time. These two waveforms differ in that the phoneme /n/ has a quasi-periodic structure with a period of about 10 milliseconds, while the phoneme /s/ has no such structure. This is because the phoneme /n/ is produced through excitation of the vocal chords while /s/ is generated by passage of air through the larynx without excitation of the vocal chords. Thus, phonemes may be either voiced (i.e., produced by excitation of the vocal chords) or unvoiced (no such excitation) and the waveform of voiced phonemes is quasi-periodic. This period, called the pitch period, is such that male voices generally have a long pitch period (low pitch frequency) while females voices generally have higher pitch frequencies.
In addition to the above voiced-unvoiced distinction, phonemes may be classified in other ways, as summarized in Table 1, for the phonemes of the General American Dialect. The vowels, voiced fricatives, voiced stops, nasal consonants, glides, and semivowels are all voiced while the unvoiced fricatives and unvoiced stop consonants are not voiced. The fricatives are produced by an incoherent noise excitation of the vocal tract by causing turbulent air to flow past a point of constriction. To produce stop consonants a complete closure of the vocal tract is formed at some point and the lungs build up pressure which is suddenly released by opening the vocal tract.
TABLE 1 ______________________________________ Phonemes Of The General American Dialect ______________________________________ Vowels /i/ as in "three" /I/ as in "it" /e/ as in "hate" /ae/ as in "at" /a/ as in "father" / / as in "all" /o/ as in "obey" /v/ as in "foot" /u/ as in "boot" / / as in "up" / / as in "bird" Unvoiced Fricative Consonants /f/ as in "for" /.theta./ as in "thin" /s/ as in "see" /S/ as in "she" /h/ as in "he" Voiced Fricative Consonants /v/ as in "vote" /.delta./ as in "then" /z/ as in "zoo" / / as in "azure" Unvoiced Stop Consonants /p/ as in " play" /t/ as in "to" /k/ as in "key" Voiced Stop Consonants /b/ as in "be" /d/ as in "day" /g/ as in "go" Nasal Consonants /m/ as in "me" /n/ as in "no" /.eta./ as in "sing" Glides and Semivowels /w/ as in "we" /j/ as in "you" /r/ as in "read" /l/ as in "let" ______________________________________
Phonemes may be characterized in other ways than by plots of their time history as was done in FIGS. 1 and 2. For example, a segment of the time history may be Fourier analyzed to produce a power spectrum, that is, a plot of signal amplitude versus frequency. Such a power spectrum for the phoneme /u/ as in "to" is presented in FIG. 3. The meaning of such a graph is that the waveform produced by superimposing many sine waves of different frequencies, each of which has the amplitude denoted in FIG. 3 at its frequency, would have the temporal structure of the initial waveform.
From the power spectrum of FIG. 3 it is seen that certain frequencies or frequency bands have larger amplitudes than do others. The lowest such band, near a frequency of 100 Hertz, is associated with the pitch of the male voice that produced this sound. The higher frequency peaks, near 300, 1000, and 2300 Hertz, provide the information that distinguishes this phoneme from all others. These frequencies, called the first, second, and third format frequencies, are therefore the variables that change with the orientation of the lips, tongue, nasal passage, etc., to produce a string of connected phonemes representing human speech.
The previous state of the art in speech synthesis is well described in a recent book (Flanagan, Speech Analysis, Synthesis, and Preception, Springer-Verlag, 1972). Two of the major goals of this work have been the understanding of speech generation and recognition processes, and the development of synthesizers having extremely large vocabularies. Through this work it has been learned that the single most important requirement of an intelligible speech synthesizer is that it produce the proper formant frequencies of the phonemes being generated. Thus, current and recent synthesizers operate by generating the formant frequencies in the following way. Depending on the phoneme of interest, either voiced or unvoiced excitation is produced by electronic means. The voiced excitation is characterized by a power spectrum having a low frequency cutoff at the pitch frequency and a power that decreases with increasing frequency above the pitch frequency. Unvoiced excitation is characterized by a broad-band white noise spectrum. One or the other of these waveforms is then passed through a series of filters or other electronic circuitry that causes certain selected frequencies (the formant frequencies of interest) to be amplified. The resulting power spectrum of voiced phonemes is like that of FIG. 3 and, when played into a speaker, produces the audible representation of the phoneme of interest. Such devices are generally called vocoders, many varieties of which may be purchased commercially. Other vocoders are disclosed in U.S. Pat. Nos. 3,102,165 and 3,318,002.
In such devices the formant frequency information required to generate a string of phonemes in order to produce connected speech is generally stored in a full-sized computer that also controls the volume, the duration, voiced and unvoiced distinctions, etc. Thus, while existing vocoders are able to generate very large vocabularies, they require a full sized computer and are not capable of being miniaturized to dimensions less than 0.25 inches, as is the synthesizer described in the present invention.
One of the important results of speech research in connection with vocoders has been the realization that phonemes cannot generally be strung together like beads on a string to produce intelligible speech (Flanagan, 1972). This is because the speech producing organs (mouth, tongue, throat, etc.) change their configurations relatively slowly, in the time range of tens to hundreds of milliseconds, during the transition from one phoneme to the next. Thus, the formant frequencies of ordinary speech change continuously during transitions and synthetic speech that does not have this property is poor in intelligibility. Many techniques for blending one phoneme into another have been developed, examples of which are disclosed in recent U.S. Pat. Nos. 3,575,555 and 3,588,353. Computer controlled vocoders are able to excel in producing large vocabularies because of the quality of their control of such blending processes.