In recent years, the development of speech synthesis techniques has enabled generation of very high-quality synthesized speech.
However, the conventional use of such synthesized speech is still centered on uniform purposes, such as reading off news texts in announcer style.
Meanwhile, speech having distinctive features (synthesized speech highly representing personal speech or synthesized speech having a distinct prosody and voice quality, such as the speech style of a high-school girl or speech with a distinct intonation of the Kansai region in Japan) has started to be distributed as a kind of content. Thus, in pursuit of further amusement in interpersonal communication, a demand for creating distinct speech to be heard by the other party is expected to grow.
Meanwhile, the method for speech synthesis is classified into two major methods. The first method is a waveform concatenation speech synthesis method in which appropriate speech elements are selected, so as to be concatenated, from a speech element database (DB) that is previously provided. The second method is an analysis-synthesis speech synthesis method in which speech is analyzed so as to generate synthesized speech based on analyzed parameters.
In terms of converting the voice quality of the above-mentioned synthesized speech in many different ways, in the waveform concatenation speech synthesis method, it is necessary to prepare the same number of the speech element DBs as necessary voice quality types, and to switch between the speech element DBs. Thus, it requires enormous costs to generate synthesized speech having various voice qualities.
On the other hand, in the speech analysis-synthesis method, the analyzed speech parameters are transformed. This allows conversion of the voice quality of the synthesized speech. Generally, a model known as a vocal tract model is used for the analysis. It is difficult, however, to completely separate speech information into voicing source information and vocal tract information. This causes a problem of sound quality degradation as a result of the transformation of incompletely-separated voicing source information (voicing source information including vocal tract information) or incompletely-separated vocal tract information (vocal tract information including voicing source information).
The conventional speech analysis-synthesis method is mainly used for compression coding of speech. In such application, such incomplete separation as described above is not a serious problem. More specifically, it is possible to obtain synthesized speech close to the original speech by re-synthesizing the speech without transforming the parameters. In a typical linear predictive coding (LPC), white noise or an impulse train, either having a uniform spectrum, is assumed for the voicing source. In addition, an all-pole transfer function in which numerators are all constant terms is assumed for the vocal tract. The voicing source spectrum is not uniform in practice. In addition, the transfer function for the vocal tract does not have an all-pole shape due to the influence of the vocal tract having a sophisticated concavo-convex shape and its divergence into the nasal cavity. Therefore, in the LPC analysis-synthesis method, a certain level of sound quality degradation is caused due to model inconsistency. It is typically known that the synthesized speech sounds stuffy-nosed or sounds like a buzzer tone.
To reduce such model inconsistency, the following measures are separately taken for the voicing source and the vocal tract.
Specifically, for the voicing source, preemphasis processing is performed on a speech waveform to be analyzed. A typical vocal tract spectrum has a tilt of −12 dB/oct. and a tilt of +6 dB/oct. is added when the speech is emitted into the air from the lips. Therefore, the spectrum tilt for the vocal-tract voicing source as a result of synthesizing the preemphasized speech waveform is generally considered as −6 dB/oct. Thus, it is possible to compensate the voicing-source spectral tilt by adding a tilt of +6 dB/oct. to the vocal-tract voicing source through differentiation of the speech waveform.
In addition, a method used for the vocal tract is to extract a component inconsistent with the all-pole model as a prediction residual and convolve the extracted prediction residual into the voicing source information, that is, to apply a residual waveform to a driving voicing source for the synthesis. This causes the waveform of the synthesized speech to completely match the original speech. A code excited linear prediction (CELP) is a technique in which the residual waveform is vector-quantized and transmitted as a code number.
According to the technique, the re-synthesized speech has a satisfactory voice quality even when the voicing source information and the vocal tract information are not completely separated due to inaccuracy of analysis attributed to low consistency of the linear prediction model.
However, in an application where voice quality is converted with varying parameters, it is important to separate the voicing source information and the vocal tract information as accurately as possible. That is, even when it is intended to change parameters attributable to the vocal tract (for example, formant center frequency), the characteristics of the voicing source are changed at the same time. Therefore, in order to allow control of the vocal tract and the voicing source separately, it is necessary to accurately separate the information regarding these two.
In the speech synthesis-analysis method, a technique for performing more accurate separation of the voicing source information and the vocal tract information is, for example, to obtain the vocal tract information, which is not sufficiently obtained in one LPC analysis, through plural LPC analyses, so as to flatten the spectral information of the voicing source (for example, see Patent Reference 1).
FIG. 1 is a block diagram showing a structure of a conventional speech analyzing apparatus described in Patent Reference 1.
Hereinafter, an operation of the conventional speech analyzing apparatus shown in FIG. 1 shall be described. An input speech signal 1a is inputted to a first spectrum analysis unit 2a and an inverse filtering unit 4a. The first spectrum analysis unit 2a analyses the input speech signal 1a so as to extract a first spectral envelope parameter, and outputs the extracted first spectral envelope parameter to a first quantization unit 3a. The first quantization unit 3a quantizes the first spectral envelope parameter so as to obtain a first quantized spectral envelope parameter, and outputs the obtained first quantized spectral envelope parameter to an inverse filtering unit 4a. The inverse filtering unit 4a inverse-filters the input speech signal 1a using the first quantized spectral envelope parameter so as to obtain a prediction residual signal, and inputs the obtained prediction residual signal to a second spectrum analysis unit 5a and a voicing source coding unit 7a. The second spectrum analysis unit 5a analyzes the prediction residual signal so as to extract a second spectral envelope parameter, and outputs the extracted second spectral envelope parameter to a second quantization unit 6a. The second quantization unit 6a quantizes the second spectral envelope parameter so as to obtain a second quantized spectral envelope parameter, and outputs the obtained second quantized spectral envelope parameter to a voicing source coding unit 7a and the outside. The voicing source coding unit 7a extracts a voicing source signal using the prediction residual signal and the second quantized spectral envelope parameter, codes the extracted voicing source signal, and outputs a coded voicing source that is the coded voicing source signal. These coded voicing source, first quantized spectral envelope parameter, and second quantized spectral envelope parameter constitute the coding result.
By thus configuring the speech analyzing apparatus, spectrum envelop characteristics, which cannot conventionally be removed only by the first spectrum analysis unit 2a, are extracted by the second spectrum analysis unit 5a. This allows flattening of the frequency characteristics of the voicing source information outputted from the voicing source coding unit 7a. 
In addition, another related technique is embodied as a speech enhancement apparatus which separates the input speech into voicing source information and vocal tract information, enhances the separated voicing source and vocal tract information individually, and generates synthesized speech using the enhanced voicing source information and vocal tract information (for example, see Patent Reference 2).
The speech enhancement apparatus calculates, when separating the input speech, an autocorrelation-function value of the input speech of a current frame. The speech enhancement apparatus also calculates an average autocorrelation-function value through weight-averaging of the autocorrelation-function value of the input speech of the current frame and the autocorrelation-function value of the input speech of a previous frame. This offsets rapid change in the shape of the vocal tract between the frames. Thus, it is possible to prevent rapid gain change at the time of enhancement. Accordingly, this makes it less likely to cause unusual phone.
[Patent Reference 1] Japanese Unexamined Patent Application Publication No. 5-257498 (pages 3 to 4, FIG. 1)
[Patent Reference 2] International Application Published under the Patent Cooperation Treaty No. 2004/040555)