The present invention relates to a sound synthesizer, and more particularly to a sound synthesizer employing a compact information processor such as microcomputer or the like. Throughout this specification and the appended claims the term "sound" is defined as consisting of an assembly of phonemes and it includes the so-called sound such as musical sounds and imitation sounds as well as imitations of animal sounds as pronounced by human beings.
As an apparatus for producing a speech sound (especially the human speech) by means of an electric circuit, the Formant Vocoder has been known. The term "formant" means a concentration of energy found at a specific frequency band in a sound signal. It is believed that this formant is determined by the resonant characteristics of the vocal tract. The speech signal is analyzed into 7 kinds of information such as a several kinds of formant frequencies (for example, first formant--third formant), their amplitudes, etc. When a resonance circuit is excited on the basis of this information, a spectrum envelope approximated to the speech signal can be reproduced. The Formant Vocoder is such a type of speech reproducer. However, at the current status of the art, it is difficult to obtain a satisfactory speech from this type of vocoder. Therefore, a speech synthesizer employing the Linear Predictive Coding System (hereinafter abbreviated as LPC) has been proposed which is based on the vocoder and a speech synthesizer making use of a method of speech segments generation in the speech synthesis of mono-syllables.
The proposed former speech synthesizer utilizes the speech band compression technique (information compression technique). Briefly speaking, it is a system of predicting from a speech signal at a preceding moment, a speech signal at the next succeeding moment. In general, a speech sound is classified into a voiced sound and an unvoiced sound. In the case of the voiced sound, a white noise signal and a periodic impulse signal are used as a driving signal. In the case of the unvoiced sound, only the white noise signal is used as a driving signal. These driving signals are amplified and then input to a lattice digital filter. At this moment, the coefficients of the filter are renewed in each sampling period to synthesize a desired speech signal. The filter coefficients are renewed each time a quantized driving signal is read out of a memory at every one frame (about 20 ms). Besides the driving signals, information necessitated for speech synthesis such as pitch information, amplitude information, etc. is stored in the memory. The amount of information contained in one frame depends upon the number of the connected filters. If 10 filters are present, an information amount of about 48 bits is necessitated. In some frames, a lesser amount of information will suffice. Generally, however, if the period of the frame is assumed to be 20 ms, then for synthesizing a speech signal is only one second, about 2,400 bits of information are necessitated. Accordingly, even if a memory having a memory density of 64K-bits/one chip is employed, a speech signal can be synthesized only for about 30 seconds. This serves as an extremely great bar against miniaturization of a speech synthesizer.
On the other hand, the amount of arithmetic operation necessitated for the speech synthesis is enormous. For example, in the arithmetic circuit is required a multiplier and, since the area occupied by a multiplier is very large, it is not favorable for an integrated circuit arrangement. Moreover, even if a pipe-line type multiplier is employed, 19 repetitions of multiplication and addition/subtraction are required. Furthermore these arithmetic operations must be carried out in each sampling cycle. In addition, a delay circuit for preventing overlap of arithmetic operations is also necessary. In this way, a speech synthesizer according to the LPC system is composed of a complex circuit and it necessitates hardware having a large area. With regard to a computing speed also, if a sampling frequency of 10 kHz is employed, 19 repetitions of arithmetic operations must be executed in 100 .mu.s. Accordingly, a high-speed logical operation capability compatible to a mini-computer is required. In other words, the cost of the synthesizer becomes high that it is hardly applicable to private instruments.
Still further, in order to improve the quality of the synthesized sound, abrupt change of parameters must be avoided. Accordingly, an interpolation circuit for interpolating intermediate values between given parameters is also necessary. Furthermore, one information is available only as a parameter for synthesizing one speech. Hence, there occurs an inconvenience that the number of synthesizable speeches is limited by the memory capacity. Especially, in the case of synthesizing, in addition to human speeches, musical sounds such as the sounds of pianos, flutes, violins, etc. and imitation sounds such as engine sounds of automobiles, aircrafts, etc., a memory having a large capacity is required.
On the other hand, in the proposed latter speech synthesizer making use of speech segments, a waveform of a speech signal is divided into parts of a short period (8 ms or 4 ms). The divided waveform part is called "speech segment". The speech segment information is edited within a memory. The speech synthesizer reads necessary speech segment information (representative segments) out of the memory in accordance with the speech signal to be synthesized. Addressing for the read out operation is executed by key input or by programming. In order for the synthesizer to synthesize a speech signal, time information, amplitude information, sequence information, etc. are required in addition to the representative segments. The synthesizer synthesizes a speech signal on the basis of this information. However, the initial digital value and the final digital value possessed by the selected representative segment are generally different for the respective representative segments. In other words, the final digital value of the first representative segment and the final digital value of the subsequent second representative segment are generally not identical. Accordingly, a speech signal having a continuous waveform variation cannot be obtained, and the synthesized speech signal assumes a discontinuous waveform having discontinuity at every segment. Consequently, the waveform becomes a speech waveform having a large distortion as compared to the natural speech waveform, and hence a speech signal of good quality could not be obtained by the prior art system.
Also, besides the above-mentioned methods, various methods have been known in which speech digital information is obtained by analyzing a speech signal with the aid of the delta modulation system (DM), pulse coded modulation system (PCM), adaptive delta modulation system (ADM), differential PCM system (DPCM), adaptive predictive coding system (APC), etc. However, no synthesizer has yet been proposed which is most suitable for synthesizing a speech signal on the basis of such analyzed information. As a matter of course, even with the PARCOR system which makes use of a partial autocorrelation coefficient, miniaturization and cost reduction of a speech synthesizer cannot be expected, because the PARCOR system also necessitates a complex filter circuit as well as a large amount of information.