An apparatus for synthesizing a speech waveform from a phoneme/prosodic sequence (obtained from an input sentence) is called “a text to speech synthesis apparatus”. In general, the text to speech synthesis apparatus includes a language processing unit, a prosody processing unit, and a speech synthesis unit. In the language processing unit, the input sentence is analyzed, and linguistic information (such as a reading, an accent, and a pause position) is determined. In the prosody processing unit, from the accent and the pause position, a fundamental frequency pattern (representing a voice pitch and an intonation change) and phoneme duration (representing duration of each phoneme) are generated as prosodic information. In the speech synthesis unit, the phoneme sequence and the prosodic information are input, and the speech waveform is generated.
As one speech synthesis method, a speech synthesis based on unit selection is widely used. With regard to the speech synthesis based on unit selection, as to each segment divided from an input text by a synthesis unit, a speech unit is selected using a cost function (having a target cost and a concatenation cost) from a speech unit database (storing a large number of speech units), and a speech waveform is generated by concatenating selected speech units. As a result, a synthesized speech having naturalness is obtained.
Furthermore, as a method for raising stability of the synthesized speech (without discontinuity occurred from the synthesized speech based on unit selection), a speech synthesis apparatus based on plural unit selection and fusion is disclosed in JP-A No. 2005-164749 (KOKAI).
With regard to the speech synthesis apparatus based on plural unit selection and fusion, as to each segment divided from the input text by a speech synthesis, a plurality of speech units is selected from the speech unit database, and the plurality of speech units is fused. By concatenating the fused speech units, a speech waveform is generated.
As a fusion method, for example, a method for averaging a pitch-cycle waveform is used. As a result, a synthesized speech having high quality (naturalness and stability) is generated.
In order to execute speech processing using spectral envelope information of speech data, various spectral parameters (representing spectral envelope information as a parameter) are proposed. For example, linear prediction coefficient, cepstrum, mel cepstrum, LSP (Line Spectrum Pair), MFCC (mel frequency cepstrum coefficient), parameter by PSE (Power Spectrum Envelope) analysis (Refer to JP-A No. H11-202883 (KOKAI)), parameter of amplitude of harmonics used for sine wave synthesis such as HNM (Harmonics Plus noise model), parameter by Mel Filter Bank (refer to “Noise-robust speech recognition using band-dependent weighted likelihood”, Yoshitaka Nishimura, Takahiro Shinozaki, Koji Iwano, Sadaoki Furui, December 2003, SP2003-116, pp. 19-24, IEICE technical report), spectrum obtained by discrete Fourier transform, and spectrum by STRAIGHT analysis, are proposed.
In case of representing spectral information by a parameter, necessary characteristic of the spectral information is different for use. In general, the parameter is desired not to be affected by fine structure of spectrum (caused by influence of harmonics). In order to execute statistic processing, spectral information of speech frame (extracted from a speech waveform) is desired to be effectively represented with high quality by a constant (few) dimension number. Accordingly, a source filter model is assumed, and coefficients of a vocal tract filter (a sound source characteristic and a vocal tract characteristic are separated) are used as a spectral parameter (such as linear prediction coefficient or a cepstrum coefficient). In case of vector-quantization, as a parameter to solve stability problem of filter, LSP is used.
Furthermore, in order to reduce information quantity of parameter, a parameter (such as mel cepstrum or MFCC) corresponding to non-linear frequency scale (such as mel scale or bark scale) which the hearing characteristic is taken into consideration is well used.
As a desired characteristic for a spectral parameter used for speech synthesis, three points, i.e., “high quality”, “effective”, “easy execution of processing corresponding to band”, are necessary.
The “high quality” means, in case of representing a speech by a spectral parameter and synthesizing a speech waveform from the spectral parameter, that the hearing quality does not drop, and the parameter can be stably extracted without influence of fine structure of spectrum.
The “effective” means that a spectral envelope can be represented by few dimension number or few information quantity. In other words, in case of operation of statistic processing, the operation can be executed by few processing quantity. Furthermore, in case of storing a storage such as a hard disk or a memory, the spectral envelope can be stored with few capacity.
The “easy execution of processing corresponding to band” means that each dimension of parameter represents fixed local frequency band, and an outline of spectral envelope is represented by plotting each dimension of parameter. As a result, processing of band-pass filter is executed by a simple operation (a value of each dimension of parameter is set to “zero”). Furthermore, in case of averaging parameters, special operation such as mapping of the parameters on a frequency axis is unnecessary. Accordingly, by directly averaging the value of each dimension, average processing of the spectral envelope can be easily realized.
Furthermore, different processing can be easily executed to a high band and a low band compared with a predetermined frequency. Accordingly, as to the speech synthesis based on plural units selection and fusion method, in case of fusing speech units, the low band can attach importance to stability and the high band can attach importance to naturalness. From these three viewpoints, above-mentioned spectral parameters are respectively considered.
As to “linear prediction coefficient”, an autoregression coefficient of the speech waveform is used as a parameter. Briefly, it is not a parameter of frequency band, and processing corresponding to band cannot be easily executed.
As to “cepstrum or mel cepstrum”, a logarithm spectrum is represented as a coefficient of sine wave basis on a linear frequency scale or non linear mel scale. However, each basis is located all over the frequency band, and a value of each dimension does not represent a local feature of the spectrum. Accordingly, processing corresponding to the band cannot be easily executed.
“LSP coefficient” is a parameter converted from the linear prediction coefficient to a discrete frequency. Briefly, a speech [0018] “LSP coefficient” is a parameter converted from the linear prediction coefficient to a discrete frequency. Briefly, a speech spectrum is represented as a density of location of the frequency, which is similar to a formant frequency. Accordingly, same dimensional value of LSP is not always assigned with a closed frequency, the dimensional value, and an adaptive averaged envelope is not always determined. As a result, processing corresponding to the band cannot be easily executed. is represented as a density of location of the frequency, which is similar to a formant frequency. Accordingly, same dimensional value of LSP is not always assigned with a closed frequency, the dimensional value, and an adaptive averaged spectral envelope is not always determined. As a result, processing corresponding to the band cannot be easily executed.
“MFCC” is a parameter of cepstrum region, which is calculated by DCT (Discrete Cosine Transform) of a mel filter bank. In the same way as the cepstrum, each basis is located all over the frequency band, and a value of each dimension does not represent a local feature of the spectrum. Accordingly, processing corresponding to the band cannot be easily executed.
As to a feature parameter by PSE model disclosed in JP-A No.H11-202883 (KOKAI), a logarithm power spectrum is sampled at each position of integral number times of fundamental frequency. The sampled data sequence is set as a coefficient for cosine series of M term, and weighted with the hearing characteristic.
The feature parameter disclosed in JP-A No.H11-202883 (KOKAI) is also a parameter of cepstrum region. Accordingly, processing corresponding to the band cannot be easily executed. Furthermore, as to the above-mentioned sampled data sequence, and a parameter sampled from a logarithm spectrum (such as amplitude of harmonics for sine wave synthesis) at each position of integral number times of fundamental frequency, a value of each dimension of the parameter does not represent a fixed frequency band. In case of averaging a plurality of parameters, a frequency band corresponding to each dimension is different. Accordingly, envelopes cannot be averaged by averaging the plurality of parameters.
In the same way, as to parameter of PSE analysis, the above-mentioned sampled data sequence and an amplitude parameter of harmonics used for sine wave synthesis (such as HMM), processing corresponding to the band cannot be easily executed.
In JP-A No. 2005-164749 (KOKAI), in case of calculating MFCC, a value obtained by the mel filter bank is used as a feature parameter without DCT, and applied to a speech recognition.
As to the feature parameter by the mel filter bank, a power spectrum is multiplied with a triangular filter bank so that the power spectrum is located at an equal interval on the mel scale. A logarithm value of power of each band is set as the feature parameter.
As to the coefficient of the mel filter bank, a value of each dimension represents a logarithm value of power of fixed frequency band, and processing corresponding to the band can be easily executed. However, regeneration of a spectrum of speech data by synthesizing the spectrum from the parameter is not taken into consideration. Briefly, this coefficient is not a parameter on the assumption that a logarithm envelope is modeled as a linear combination of basis and coefficient, i.e., not a high quality parameter. Actually, coefficients of the mel filter bank does not often have sufficient fitting ability to a valley part of the logarithm spectrum. In case of synthesizing a spectrum from coefficients of the mel filter bank, sound quality often drops.
As to a spectrum obtained by the discrete Fourier transform or the STRAIGHT analysis, processing corresponding to the band can be easily executed. However, these spectra have the number of dimension larger than a window length for analyzing speech data, i.e., ineffective.
Furthermore, the spectrum obtained by the discrete Fourier transform often includes fine structure of spectrum. Briefly, this spectrum is not always a high quality parameter.
As mentioned-above, various spectral envelope parameters are proposed. However, the spectral envelope parameter having three points (“high quality”, “effective”, “easy execution of processing corresponding to band”) necessary for speech synthesis is not considered yet.