In the information service network for offering information such as stock market conditions, weather forecasts, guidance on various exhibitions and so on in the form of speech, it is desired that different kinds of information are transmitted on a digital signal to the terminal equipment of the network, where the digital signal is converted to speech by a speech synthesizer. In a teaching machine, vending machine, anouncment apparatus for giving announcements at a meeting and so on where a small number of spoken words are used, a speech synthesizer can be used which employs a semiconductor memory instead of a magnetic recording tape which has been used to date.
In a digital speech synthesizer in which speech signals are converted to digital signals and then stored and the stored digital signals are combined in such a manner as to form speech, a continuous speech signal is chopped at constant time intervals and characteristic parameters of the speech are extracted from the chopped speech waveforms. These parameters are converted to digital signals and stored. The stored parameters are combined in such a manner as to form speech. Thus, a speech unit of the synthesized sound can be reduced to a monosyllable shorter than a word. This permits a number of words to be formed without increase of the memory capacity. In addition, such a speech synthesizer has no mechanically movable portions and therefore does not cause any trouble due to wear or the like so that the maintenance thereof is easy.
It is thus preferable that a speech synthesizer synthesizes speech on the basis of the characteristic parameters of speech for easy maintenance and small memory capacity.
Since the spectrum distribution of speech is changed by the natural movement of the voice modifying organs such as the tongue and the lips, the change of the spectrum distribution is gentle, and during a short period of time in the range of 10 to 3 m seconds it can be considered to be substantially stationary. Thus, the characteristics of the spectrum of speech are derived precisely from the spectrum of speech during this stationary period of time, thereby to enable the analysis of speech, and synthesis of speech on the basis of the extracted information. For analysis and synthesis of speech, it is necessary to derive from the speech spectrum during the short period of time in which the change of distribution of the speech spectrum can be considered to be stationary, a parameter indicative of the envelope of the spectrum, a parameter indicative of the amplitude of the speech signal, pitch information corresponding to the fundamental vibration frequency of the vocal chords, and discrimination information for indicating a voiced sound or an unvoiced sound.
One of the speech analysis and synthesis systems for the extraction of the characteristic parameters from speech signals, and for synthesizing the speech signals on the basis of the parameters is a PARCOR type method using PARCOR coefficients (partial auto-correlation coefficients) as a kind of a linear prediction coefficient.
The apparatus utilizing this method produces PARCOR coefficients as the characteristic parameters of speech signals. That is, a speech signal during a short period of time in which the change of the frequency spectrum of the speech signal is gentle and stationary is sampled at a sampling period of, for example, 8 kHz. The samples at two close points, of the successive samples are estimated by the least squares of the samples existing between those at the two points. The predicted values are compared with the actual sample values at the two points and then the correlation (PARCOR coefficients) among the resulting differences are determined. In the speech synthesizer, a signal generator for generating white noise and a pulse is used as a sound source. The amplitude of the output signal from the sound source is controlled by the PARCOR coefficients as set forth above to have a correlation. Thus, the frequency spectrum envelope is reproduced to enable the speech synthesis.
This PARCOR type speech analysis and synthesis method can handle the PARCOR coefficient, pitch information, amplitude information and discrimination information for discriminating between voiced sound and silent sound in binary values. These kinds of information can be stored in a semiconductor memory. In addition, the binary information can be transmitted through telephone channels.
For analysis of speech and extraction of characteristic parameters of speech, the speech is sampled during a short period of time as described above. This short period of time is generally called the analytical frame or simply the frame. From one frame is extracted a PARCOR coefficient, pitch information, amplitude information, and discrimination information for discriminating between voiced and unvoiced sounds. The information per frame is transferred in 96 bits, for example. If one frame corresponds to 20 m second, this amount of information is 4800 bits/second, and if one frame is 10 m second, it is 9600 bits/second.
The speech synthesizer for synthesizing speech on the basis of speech parameters obtained by analysis of the speech provides a synthesized speech the quality of which is determined by the amount of information for use in the synthesis. For example, the sound quality in the case of 9600 bits/sec. at which the speech parameters obtained by analysis of speech are transmitted is apparently better than that in the case of 4800 bits/sec. However, while the amount of information of 9600 bits/sec. satisfactorily provides better sound quality when there are more idle channels in the digital telephone, the 4800 bits/sec. will rather increase the utilizing efficiency of channel when there are few idle channels, although the sound quality is slightly deteriorated. When the speech information is stored in a semiconductor memory or the like, the amount of information to be decided depends on which of the sound quality and the memory capacity is first taken into account.
The conventional speech synthesizer can handle only a fixed amount of speech information per unit time and cannot handle a different amount of speech information. For example, the speech synthesizer capable of 9600 bits/ sec. cannot process speech information at 4800 bits/sec. Therefore, the amount of information per unit time cannot be changed in accordance with the extent to which the telephone channel is crowded with calls. In addition, the selection of a speech synthesizer with a memory depends on which of the sound quality and the memory capacity is first taken into account.
It is an object of the invention to provide a speech synthesizer capable of synthesizing speech on the basis of speech parameters of the type in which a plurality of different amounts of information per unit time are used.