To create a speech signal artificially from a given sentence is referred to as “text-to-speech synthesis”. The text-to-speech synthesis is carried out generally by three units; a text processing unit configured to carry out text-normalization, morphological analysis (tokenization and POS tagging), or syntactic analysis of an entered text, a prosodic processing unit configured to predict appropriate intonation, rhythm, etc., based on text processing results and output phonological sequence plus prosodic information (fundamental frequency, phonological/segmental duration, power, etc.), and a speech synthesizer configured to synthesize speech signals from the phonological sequence and prosodic information. In a method of speech synthesis, which is carried out in the speech synthesizer among these units, it is necessary to carry out a speech synthesis for a given phonological sequence with a given prosody generated in the prosodic processing unit.
As an example of the method of speech synthesis, a unit-selection type method is well-known (for example, see JP-A-2001-282278 (Kokai), hereinafter referred to as Patent Document 1). In this method, first, a sequence of speech units is selected from a large quantity of speech units stored in advance, referring to the input phonological sequence/prosodic information as a target for each of a plurality of segments (synthetic unit sequence), which are obtained by dividing the input phonological sequence, and then a speech waveform is synthesized by concatenating the sequence of selected speech units.
In the method of speech synthesis disclosed in Patent Document 1, a cost which indicates the degree of deterioration of the synthetic speech caused during synthesis process is defined by a function called “cost function”, and the speech units are selected so that the cost is minimized. For example, distortion caused by editing speech-units and distortion caused by concatenating them are estimated using the cost, and the speech unit sequence used for the speech synthesis is selected on the basis of the cost, and the synthesized speech is generated on the basis of the selected speech unit sequence.
As in the method of speech synthesis disclosed in Patent Document 1, deterioration of speech quality in the synthetic speech caused by editing and concatenating the units can be restrained by selecting an adequate speech unit sequence from a large quantity of speech unit considering the degree of deterioration caused by synthesizing the speech.
However, the unit-selection type method of speech synthesis disclosed in Patent Document 1 has a problem that the speech quality of the synthesized speech is partly deteriorated.
The reasons are as follows.
The first reason is that even though a huge number of speech units are stored in advance, speech units adequate for various phonological/prosodic environments do not necessarily exist.
The second reason is that the degree of deterioration of the synthesized speech that people actually feels cannot be represented perfectly by the cost function, and hence the optimal unit sequence cannot necessarily be selected.
The third reason is that since the number of the speech units is very large, it is difficult to exclude defective speech units in advance and the cost function for removing such defective speech units is also difficult to design, so such defective speech units may be mixed sometimes in the selected speech unit sequence.
Therefore, instead of selecting a single speech unit per a single segment, another method that selects a plurality of speech units per a single segment, fusing these speech units to generate a new speech unit for each segment and, synthesizing the speech waveform using the generated new speech units is disclosed (JP-A-2005-164749 (Kokai), hereinafter, referred to as Patent Document 2). Hereinafter, this method is referred to as a “multiple unit selection and fusion type method of speech synthesis”.
In the multiple unit selection and fusion type method of speech synthesis disclosed in Patent Document 2, high-quality new speech units are generated by fusing the plurality of speech units per a single segment even when adequate speech units suitable for the target phonological/prosodic environment do not exist, when optimal speech units are not selected, or when defective units are selected, and the problems in the unit-selection type method of speech synthesis described above are improved and the speech synthesis with high speech quality having higher stability is realized by carrying out the speech synthesis using the newly generated speech units.
However, the method of fusing the speech units disclosed in Patent Document 2 is a method taking notice of specifically periodic components in the voiced sounds (periodic components) and aiming at averaging these components adequately.
Although main components of the voiced sound are periodic components since it is generated mainly from periodic pulses of vocal cord vibrations as a voice source, there are actually aperiodic components as well; one is generated by exciting the vocal tract with air turbulence occurring when aspirated air passes through a narrow point of vocal tract or the chink of the glottis, and another is caused by fluctuations in periodicity of the vocal cord vibrations. In particular, in the case of the voiced fricative, the aperiodic components are very important elements which determine the phonological property. As regards vowel, a husky voice or the voice of persons who speak with a breathy voice includes relatively large aperiodic components, which do not affect directly the phonological property, but are important elements which determine the speaker characteristic.
When the speech units of the actual voiced sound having the periodic components and aperiodic components (aperiodic components) mixed therein are fused in this manner, the aperiodic components which have no correlation between units are cancelled and attenuated, or the phase of the aperiodic components which should be random are partly aligned, so that problems such that the naturalness of speech may be impaired or noise may be generated.
In overlapping the fused speech units to generate the synthesized waveform, when the given target duration is longer than the duration of the speech unit, it is necessary to elongate the speech units by repeating some pitch-cycle waveforms in the speech unit. However, at this time, an unnatural periodicity is generated by the repeated aperiodic components contained in the pitch-cycle waveforms, and hence there arise problems of generation of a sense of buzziness and degradation of naturalness of the speech quality.