1. Field of the Invention
The present invention relates to a synthesized sound generating apparatus and method which is suitable for inputting and synthesizing voices and instrumental sounds and outputting synthesized instrumental sounds or the like having characteristic information on the voices.
2. Prior Art
Vocoders, which have a function for analyzing and synthesizing voices, are commonly used with music synthesizers due to their ability to onomatopoeically generate instrumental sounds, noise, or the like. Major known developed vocoders include formant vocoders, linear predictive analysis and synthesis systems (PARCO analysis and synthesis), cepstrum vocoders (speech synthesis based on homomorphic filtering), channel vocoders (what is called Dudley vocoders), and the like.
The formant vocoder uses a terminal analog synthesizer to carry out sound synthesis based on parameters for vocal tract characteristics determined from a formant and an anti-formant of a spectral envelope, that is, pole and zero points thereof. The terminal analog synthesizer is comprised of a plurality of resonance circuits and antiresonance circuits arranged in cascade connection for simulating resonance/antiresonance characteristics of a vocal tract. The linear predictive analysis and synthesis system is an extension of the predictive encoding method, which is most popular among the speech synthesis methods. The PARCO analysis and synthesis system is an improved version of the linear predictive analysis and synthesis system. The cepstrum vocoder is a speech synthesis system using a logarithmic amplitude characteristic of a filter and inverse Fourier transformation and inverse convolution of a logarithmic spectrum of a sound source.
The channel vocoder uses bandpass filters 10-1 to 10-N for different bands to extract spectral envelope information on an input speech signal, that is, parameters for the vocal tract characteristics, as shown in FIG. 1, for example. On the other hand, a pulse train generator 21 and a noise generator 22 generate two kinds of sound source signals, which are amplitude-modulated using the spectral envelope parameters. This amplitude modulation is carried out by multipliers (modulators) 30-1 to 30-N. Modulated signals output from the multipliers (modulators) 30-1 to 30-N pass through bandpass filters 40-1 to 40-N and are then added together by an adder 50 whereby a synthesized speech signal is generated and output.
In the example of the channel vocoder disclosed in Japanese Laid-Open Patent Publication (Kokai) No. 05-204397, outputs from the bandpass filters 10-1 to 10-N are rectified and smoothed when passing through short-time average-amplitude detection circuits 60-1 to 60-N. A voice sound/unvoiced sound detector 71 determines a voice sound component and an unvoiced sound component of the input speech signal, and upon detecting the voice sound component, the detector 71 operates a switch 23 so as to select and deliver an output (pulse train) from the pulse train generator 21 to the multipliers 30-1 to 30-N. In addition, upon detecting the unvoiced sound component, the voice sound/unvoiced sound detector 71 operates the switch 23 so as to select and deliver an output (noise) from the noise generator 22 to the multipliers 30-1 to 30-N. At the same time, a pitch detector 72 detects a pitch of the input speech signal to cause it to be reflected in the output pulse train from the pulse generator 21. Thus, when the voice sound component is detected, the output from the pulse generator 21 contains pitch information, which is among characteristic information on the input speech signal.
According to the above described formant vocoder, however, since the formant and anti-formant from the spectral envelope cannot be easily extracted, the formant vocoder requires a complicated analysis process or manual operation. The linear predictive analysis and synthesis system uses an all-pole model to generate sounds and uses a simple mean square value of prediction errors, as an evaluative reference for determining coefficients for the model. Thus, this method does not focus on the nature of voices. The cepstrum vocoder requires a large amount of time for spectral processing and Fourier transformation and is thus insufficiently responsive in real time.
On the other hand, the channel vocoder directly expresses the parameters for the vocal tract characteristics in physical amounts in the frequency domain and thus takes the nature of voices into consideration. Due to the lack of mathematical strictness, however, the channel vocoder is not suited for digital processing.
There is provided a synthesized sound generating apparatus and method which can achieve responsive and high-quality speech synthesis based on a real-time convolution operation. Coefficients are generated by using dynamic cutting to extract characteristic information from a first signal. A convolution operation in the time domain is performed on a second signal using the generated coefficients to generate a synthesized signal. An interpolation process is performed on the coefficients to prevent a rapid change in level of the generated synthesized signal upon switching of the coefficients.