1. Field of the Invention
This invention pertains to electronic speech producing systems and more particularly to systems that receive parameter encoding information such as allophonic code, which is decoded, stressed and synthesized in an LPC speech synthesizer to provide unlimited vocabulary.
2. Description of the Prior Art
Waveforming encoding and parameter encoding generally categorize the prior art techniques. Waveform encoding includes uncompressed digital data-pulse code modulation (PCM), delta modulation (DM), continuous variable slope delta modulation (CVSD) and a technique developed by Mozer (see U.S. Pat. No. 4,214,125). Parameter encoding includes channel vocoder, Formant synthesis, and linear predictive coding (LPC).
PCM involves converting a speech signal into digital information using an A/D converter. Digital information is stored in memory and played back through a D/A converter through a low-pass filter, amplifier and speaker. The advantages of this approach is its simplicity. Both A/D converters and D/A converters are available and relatively inexpensive. The problem involved is the amount of data storage required. Assuming a maximum frequency of 4K Hz, and further assuming each speech sample being represented by 8 to 12 bits, one second of speech requires 64K to 96K bits of memory.
DM is a technique for compressing the speech data by assuming that the analog-speech signal is either increasing or decreasing in amplitude. The speech signal is sampled at a rate of approximately 64,000 times per second. Each sample is then compared to the estimated value of the previous sample. If the first value is greater than the estimated value of the latter, then the slope of the signal generated by the model is positive. If not, the slope is then negative. The magnitude of the slope is chosen such that it is at least as large as the maximum expected slope of the signal.
CVSD is a technique that is an extension of DM which is accomplished by allowing the slope of the generated signal to vary. The data rate in DM is typically in the order of 64K bits per second and in CVSD it is approximately 16K-32K bits per second.
The Mozer technique takes advantage of the periodicity of voiced speech waveform and the perceptual insensitivity to the phase information of the speech signal. Compressing the information in the speech waveform requires phase-angle adjustment to obtain a time-symmetrical pitch waveform which makes one-half of the waveform redundant; half period zeroing to eliminate relatively low-power segments of the waveform; digital compression using DM and repetition of pitch periods to eliminate redundant (or similar) speech segments. The data rate of this technique is approximately 2.4K bits per second.
In parameter encoding schemes, speech characteristics other than the original speech waveform are used in the analysis and synthesis. These characteristics are used to control the synthesis model to create an output speech signal which is similar to the original. The commonly used techniques attempt to describe the spectral response, the spectral peaks or the vocal tract.
The channel vocoder has a bank of band-pass filters which are designed so that the frequency range of the speech signal can be divided into relatively narrow frequency ranges. After the signal has been divided into the narrow bands the energy is detected and stored for each band. The production of the speech signal is accomplished by a bank of narrow band frequency generators, which correspond to the frequencies of the band-pass filters, controlled by pitch information extracted from the original speech signal. The signal amplitude of each of the frequency generators is determined by the energy of the original speech signal detected during the analysis. The data rate of the channel vocoder is typically in the order of 2.4K bits per second.
In formant synthesis, the short time frequency spectrum is analyzed to the extent that the spectral shape is recreated using the formant center frequencies, their band-widths and the pitch period as the inputs. The formants are the peaks in a frequency spectrum envelope. The data rate for formant synthesis is typically 500 bits per second.
Linear predictive coding (LPC) can best be described as a mathematical model of the human vocal tract. The parameters used to control the model represent the amount of energy delivered by the lungs (amplitude), the vibration of the vocal cords (pitch period and the voiced/unvoiced decision), and the shape of the vocal tract (reflection coefficients). In the prior art, LPC synthesis has been accomplished through computer simulation techniques. More recently, LPC synthesizers have been fabricated in a semiconductor, integrated circuit chip such as that described and claimed in U.S. Pat. No. 4,209,836 entitled "Speech Synthesis Integrated Circuit Device" and assigned to the assignee of this invention.
This invention is a combination of a speech construction technique and a speech synthesis technique. The prior art set out above involves synthesis techniques.
With respect to speech construction techniques, the library of available component sounds includes phonemes, allophones, diphones, demisyllables, morphs and combinations of these sounds.
Speech construction techniques involving phonemes are flexible techniques in the prior art. In English, there are 16 vowel phonemes and 24 consonant phonemes making a total of 40. Theoretically, any word or phrase desired should be capable of being constructed from these phonemes. However, when each phoneme is actually pronounced there are many minor variations that may occur between sounds, which may in turn modify the pronunciation of the phoneme. This inaccuracy in representing sounds causes difficulty in understanding the resulting speech produced by the synthesis device.
Another prior art construction technique involves the use of diphones. A diphone is defined as the sound that extends from the middle of one phoneme to the middle of the next phoneme. It is chosen as a component sound to reduce smoothing requirements between adjacent phonemes. However, to encompass any of the coarticulation effects in English, a large inventory of diphones is usually required. The storage requirement is in the order of 250K bytes, with a computer required to handle the construction program.
Demisyllables have been used in the prior art as component sounds for speech construction. A syllable in any language may be divided into an initial demisyllable, final demisyllable and possible phonetic affixes. The initial demisyllable consists of any initial consonants and the transition into the vowel. The final demisyllable consists of the vowel and any co-final consonants. The phonetic affixes consist of all syllable-final non-core consonants. The prior art system requires a library of 841 initial and final demisyllables and 5 phonetic affixes. The memory requirement is in the order of 50K bytes.
A morph is the smallest unit of sound that has a meaning. In a prior art system, for unrestricted English text, a dictionary of 12,000 morphs was used which required approximately 600K bytes of memory. The speech generated is intelligible and quite natural but the memory requirement is prohibitive.
An allophone is a subset of a phoneme, which is modified by the environment in which it occurs. For example, the aspirated /p/ in "push" and the unaspirated /p/ in "Spain" are different allophones of the phoneme /p/. Thus, allophones are more accurate in representing sounds than phonemes. According to the present invention, 127 allophones are stored in 3,000 bytes of memory. The storage requirement is much less than the aforementioned system using diphones, demisyllables and morphs.