1. Field of the Invention
The present invention relates to a speech synthesizer for synthesizing a speech waveform by superposing impulse response waveforms.
2. Related Background Art
Various types of synthesizing methods for use with a speech synthesizer have been proposed heretofore. As compared with a speech synthesizer using a recording/editing method (PCM, ADPCM or the like), a speech synthesizer using a parameter editing method which uses speech feature parameters derived previously from a speech or a rule synthesizing method are very effective in information compression so that various types of words and sentences are allowed to be synthesized, although it produces an unnatural synthesized speech that is hard to understand.
According to the parameter editing method, various types of feature parameters called PARCOR, LSP, or cepstrum are used as the coefficients of a synthesizer filter which superposes impulse response waveforms to produce a synthesized speech. Generally, an impulse train signal is inputted to the synthesizer filter for synthesizing voiced speech, whereas a white noise signal or M-series signal is inputted for synthesizing unvoiced speech. A synthesizer filter having a minimum phase characteristic is often used.
Female speech, fairly natural and beautiful, can be synthesized although it has been heretofore considered difficult to synthesize as such, by using a method of superposing zero phase impulse response waveforms in the power spectrum envelope (PSE) speech analysis/synthesis method which aims at a high quality speech synthesis on the basis of the same parameter editing method.
This method of superposing zero phase impulse response waveforms is called an "OWA (Over-Wrap-Adding) method. This OWA method will be briefly described
Voiced speech mainly constituting a vowel sound is a sound which is produced such that air expired from a lung pulses intermittently at a constant period by using vibration of the vocal chords, and resonation at the tongue, lips, chin and so on. The period of intermittent vibration of the vocal chords determine the pitch of the sound. A change in the vibration period of the vocal chords with time causes one's accent and intonation. Alternatively, unvoiced speech constituting a particular consonant is a sound which is produced such that air expired from a lung has a turbulent flow having an indefinite period when it passes through a narrow space (called an articulatory point) defined particularly by the tongue tip, teeth, lips and so on within the articulatory organs.
A speech synthesizer generally synthesizes a speech waveform by using synthesizing processes analogous to those for human speech as described above.
FIG. 6 shows an example of the typical structure of a conventional speech synthesizer.
In FIG. 6, a speech signal, generated by an oscillator 1 corresponding to human vocal chords and an articulatory point, is shaped and modulated by a modulation circuit 2 corresponding to human articulatory organs, and converted into a speech at a speaker 3 and outputted therefrom.
The oscillator 1 is constructed of a noise generator 1-1 for generating high frequency white noises, a pulse generator 1-2 for generating pulses of a predetermined period which are used for voiced speech, a switch 1-3, and a variable amplifier (multiplier) 1-4. In response to an instruction supplied for each speech section from an external master device (system controller), the switch 1-3 selects either the noise generator 1-1 or the pulse generator 1-2.
In producing voiced speech, a voiced speech indication V causes the switch 1-3 to select the pulse generator 1-2 which generates pulse signals of a predetermined period P determined by an external instruction. The pulse signals are sent via the switch 1-3 to the variable amplifier 1-4 at which they are amplified at gains defined by partial autocorrelation coefficients. Thereafter, they are sent to the modulation circuit 2 using a vocal tract articulatory equivalent filter at which they are shaped and modulated into a voiced synthesized speech waveform and outputted from the speaker 3.
In a similar manner, in producing unvoiced speech, an unvoiced speech indication U causes the switch 1-3 to select the noise generator 1-3 which generates noise signals. The noise signals are sent via the switch 1-3 to the variable amplifier 1-4 at which they are amplified. The amplified noise signals are shaped and modulated into an unvoiced synthesized waveform and outputted from the speaker 3.
Various control data such as the frequency P, the amplitude A, and so on, of a signal supplied to the speech synthesizer circuit have been determined heretofore such that a speech waveform of an actual human voice is analyzed by means of one of the above-described various speech analysis methods, and the speech wave form is attributed to a certain model determined by the analysis results.
With the above-described OWA method, synthesized speech has been produced such that for a voiced speech section having a pitch, a power spectrum envelope (PSE) obtained in accordance with a power spectrum envelope parameter is subjected to an inverse Fourier transform to thereby generate an impulse response waveform, and the impulse response waveform is superposed at the modulation circuit 2 at a time interval of the pitch period. Also, for an unvoiced speech section without a pitch, an impulse response waveform is obtained in accordance with the power spectrum envelope at the section, the impulse response waveforms to the noise signal are multiplied by random values having a zero mean value at an equal time interval (about 0.17 msec time interval), to thereby randomly change the amplitudes of the impulse response waveforms which are then superposed, and the obtained unvoiced synthesized waveform is multiplied by a coefficient at an interval of a constant section to thereby make the power of the synthesized waveform substantially equal to the original speech waveform and realize a characteristic similar to random noise.
With the conventional OWA method, particularly in synthesizing a speech at the unvoiced speech section, the amplitudes of noise signals are changed randomly. However, the period for superposing the impulse response waveforms is constant so that the synthesized speech sounds something like a sound from a buzzer for example. Further, there arises also a disadvantage that unvoiced speech synthesized by the conventional OWA method does not have a random noise characteristic analogous to the human voice in the strict sense of the word. This is confirmed by a finer frequency resolution spectrum analysis of the synthesized waveform, which analysis indicates the presence of spectrum peaks at the positions of the integral-fold of the superposition frequency (an inverse of the superposition period).