1. Field of the Invention
The invention relates to a voice synthesis for artificially forming a voice waveform signal.
2. Description of Related Art
A voice waveform by a natural voice can be expressed by coupling, in a time sequential manner, basic units in which phonemes, namely, one or two vowels (hereinafter, each referred to as V) and one or two consonants (hereinafter, each referred to as C) are connected in such a manner as xe2x80x9cCVxe2x80x9d, xe2x80x9cCVCxe2x80x9d, or xe2x80x9cVCVxe2x80x9d.
Therefore, if a character string in a document is replaced with a phoneme train in which phonemes are coupled as mentioned above and a sound corresponding to each phoneme in the phoneme train is sequentially formed, a desired document (text) can be read out by an artificial voice.
A text voice synthesizing apparatus is an apparatus that can provide the function described above and a typical voice synthesizing apparatus comprises a text analysis processing unit for forming an intermediate language character string signal obtained by inserting information such as accent, phrase, or the like into a supplied text, and a voice synthesis processing unit for synthesizing a voice waveform signal corresponding to the intermediate language character string signal.
The voice synthesis processing unit comprises a sound source module for generating a pulse signal corresponding to a voiced sound and a noise signal corresponding to a voiceless sound as a basic sound, and a voice route filter for generating a voice waveform signal by performing a filtering process to the basic sound. The voice synthesis processing unit is further provided with a phoneme data memory in which filter coefficients, of the voice route filter obtained by converting voice samples at the time when a voice sample target person actually reads out a text, are stored as phoneme data.
The voice synthesis processing unit is operative to divide the intermediate language character string signal supplied from the text analysis processing unit into a plurality of phonemes, to read out the phoneme data corresponding to each phoneme from the phoneme data memory, and to use it as filter coefficients of the voice route filter.
With this construction, the supplied text is converted into the voice waveform signal having a voice tone corresponding to a frequency (hereinafter, referred to as a pitch frequency) of a pulse signal indicative of the basic sound.
However, there remains an influence by the pitch frequency of the voice which has been actually read out by the voice sample target person not a little in the phoneme data which is stored in the phoneme data memory. On the other hand, the pitch frequency of the voice waveform signal to be synthesized hardly coincides with the pitch frequency of the voice which has been actually read out by the voice sample target person.
Therefore, a problem exists that a frequency caused by the influence of the pitch frequency component, which is included in the phoneme data at the time of voice synthesis is not perfectly removed, and such a frequency and the pitch frequency of the voice waveform signal to be synthesized mutually interfere and as a result an unnatural synthetic voice is produced.
It is an object of the invention to provide a phoneme data forming method for use in a voice synthesizing apparatus in which a natural synthetic voice can be obtained irrespective of a pitch frequency of a voice waveform signal to be synthesized and generated and provide a voice synthesizing apparatus.
According to one aspect of the invention, there is provided a phoneme data forming method for use in a voice synthesizing apparatus that obtains a voice waveform signal by effecting a filtering-process to a frequency signal by using filter characteristics according to the phoneme data, comprising the steps of: separating each of input voice samples into a plurality of phonemes; obtaining a linear predictive coding coefficient by performing a linear predictive coding analysis to each of said plurality of phonemes, setting it as temporary phoneme data, obtaining a linear predictive coding Cepstrum based on the linear predictive coding coefficient, and setting it as a first linear predictive coding Cepstrum; obtaining a linear predictive coding Cepstrum by performing the linear predictive coding analysis to each of the voice waveform signals obtained by the voice synthesizing apparatus while changing a frequency of the frequency signal step by step with a filter characteristic of the voice synthesizing apparatus being set to a filter characteristic according to the temporary phoneme data, and setting it as a second linear predictive coding Cepstrum; obtaining an error between the first linear predictive coding Cepstrum and the second linear predictive coding Cepstrum as a linear predictive coding Cepstrum distortion; classifying each phoneme in a phoneme group belonging to a same phoneme name in each of the phonemes into a plurality of groups every phoneme length; and selecting the phoneme of the smallest linear predictive coding Cepstrum distortion from the group every group and using the temporary phoneme data corresponding to the selected phoneme as the phoneme data.
According to another aspect of the invention, there is provided a voice synthesizing apparatus comprising: a phoneme data memory in which a plurality of phoneme data corresponding to each of a plurality of phonemes has previously been stored; a sound source for generating frequency signals indicative of a voiced sound and a voiceless sound; and a voice route filter for obtaining a voice waveform signal by filtering-processing the frequency signal based on filter characteristics according to the phoneme data, wherein a linear predictive coding coefficient is obtained by performing a linear predictive coding analysis to the phoneme and set to temporary phoneme data, a linear predictive coding Cepstrum based on the linear predictive coding coefficient is obtained and set to a first linear predictive coding Cepstrum, filter characteristics of the voice synthesizing apparatus are set to filter characteristics according to the temporary phoneme data, when a frequency of the frequency signal is changed step by step, the linear predictive coding analysis is performed to each of the voice waveform signals at each of the frequencies obtained by the voice synthesizing apparatus, a linear predictive coding Cepstrum is obtained and set to a second linear predictive coding Cepstrum, an error between the first linear predictive coding Cepstrum and the second linear predictive coding Cepstrum is obtained as a linear predictive coding Cepstrum distortion, each phoneme in a phoneme group belonging to a same phoneme name in each of the phonemes is classified into a plurality of groups every phoneme length, and each of the phoneme data is the temporary phoneme data corresponding to the optimum phoneme selected from the group based on the linear predictive coding Cepstrum distortion.