A variety of speech synthesis apparatus have been developed which analyze a text sentence and generate synthesized speech by synthesis by rule from the speech information indicated by the sentence.
Among these, typical conventional apparatus for speech synthesis, employing the synthesis by rule, includes a storage in which are stored in large amount,                unit waveforms (unit waveforms of durations of the order of a syllable or pitch extracted from natural speech, for instance);        phonological information such as information on an environment in which a phoneme is uttered, or on pitch shape in the phoneme, amplitude or duration; and        prosodic information.        
At the time of speech synthesis, a conventional speech synthesis apparatus, employing the synthesis by rule, reads an optimum unit waveform from the storage, based on phonological information and prosodic information, generated from the results of analysis of an input text sentence. The apparatus then concatenates a plurality of unit waveforms, as it places the so read out unit waveforms at the positions of pitch synchronization (a waveform center location of each unit waveform) as generated from the prosodic information. The apparatus then outputs the synthesized speech.
In the conventional speech synthesis apparatus, the position of pitch synchronization is controlled at a precision of the sampling period of the synthesized speech.
This leads to lowered precision of the position of pitch synchronization and to deteriorated sound quality of the synthesized speech. If, in particular, the pitch frequency is high and the interval between the positions of pitch synchronization is narrow, an error in the position of pitch synchronization leads to significant deterioration in the sound quality.
To overcome the above problem inherent in the speech synthesis apparatus, attempts have been made to improve the precision in the position of pitch synchronization.
For example, Patent Document 1 discloses a method and an apparatus for speech synthesis in which the sampling rate of a unit waveform is converted at the time of speech synthesis to control the position of pitch synchronization with an accuracy higher than the width of change of the minimum pitch time duration as determined by the sampling frequency. A unit waveform processing section performs n-fold sampling frequency conversion on the unit waveform sliced from a file (i.e. the above storage) by a unit waveform generation section in accordance with phonological parameters. The unit waveform processing section then re-samples the data, resulting from the frequency conversion, with the original sampling frequency, as the sampling start position is changed, to generate n unit waveforms each having a different phase. A unit waveform placement section selects, out of these n unit waveforms, the waveform of the phase as determined by a unit waveform location controller, in accordance with the phonological parameter having the n-fold pitch period parameter, and places the so selected waveform at a temporal position as determined by the unit waveform location controller.
The processing of the conventional technique for speech synthesis, which reads unit waveforms from the storage holding the unit waveform information, based on prosody, phonology and pitch frequency, and which then carries out the conversion of the sampling rate of the so read out unit waveforms, will now be described with reference to the waveform diagrams of FIGS. 21A to 21E. It is assumed that, in the example of FIGS. 21A to 21E, the position of pitch synchronization is approximately 49.75, and that the conversion ratio is 4.
FIG. 21A shows the state before placing the unit waveform. It is assumed that, in the present example, a thick elongated line in FIG. 21A denotes the position of pitch synchronization.
It is then assumed that a unit waveform, shown in FIG. 21B, has been selected from the storage based on prosody, phonology and pitch frequency. If the sampling rate conversion is then carried out on this unit waveform, with the conversion ratio of 4, the waveform shown in FIG. 21E is generated.
As a method for converting the sampling rate, there is such as method in which a zero sample interpolation and a low pass filter (LPF) are combined.
With the conversion ratio equal to N, (N−1) sampling points, each with a value of zero, are inserted between neighboring sampling points, in order to make the number of data points N times that before conversion.
The resulting waveform is passed through a low-pass filter having, as the passband, the same band as that of the waveform prior to sampling rate conversion. The waveform resulting from this processing is the unit waveform of the converted sampling rate N times as high as that before conversion.
Out of the unit waveforms which have undergone sampling-rate-conversion, that is, rate-converted waveforms, unit waveforms are read at a pre-conversion sampling rate, as the read positions are shifted by one sample for each readout operation. This yields N unit waveforms, each with a phase (position of the waveform center of the unit waveform) differing by 1/N sample. In short, it may be said that N unit waveforms, each having a different phase, have now been generated by the sampling rate conversion.
Out of N type of unit waveforms (not shown), the waveform shown in FIG. 21D then is selected as the waveform having a phase such that the waveform center coincides with the position of pitch synchronization. The processing of extracting the waveform having a specified phase out of the unit waveforms which have undergone sampling-rate-conversion is the processing of lowering the sampling rate and hence is herein sometimes referred to as the ‘processing for waveform decimation’.
When the so selected unit waveform is placed at the position of pitch synchronization, there is obtained a state in which the unit waveform has been placed in position, as shown in FIG. 21E.
[Patent Document 1]
    JP Patent Kokai Publication No. JP-A-9-31939