Heretofore, there have been developed various speech synthesizing apparatuses for analyzing text and generating synthesized speech by rule-based synthesis from speech information indicated by the text.
FIG. 9 is a diagram showing a configuration of one example of a speech synthesizing apparatus of a general rule-based synthesis type. With regard to details of the configuration and operation of the speech synthesizing apparatus having this type of configuration, reference is made to descriptions of Non-Patent Documents 1 to 3 and Patent Documents 1 and 2, for example.
Referring to FIG. 9, the speech synthesizing apparatus includes a language processing unit 10, a prosody generation unit 11, a segment selection unit 16, a speech segment information storage unit 15, a prosody control unit 18, and a waveform connection unit 19.
The speech segment information storage unit 15 includes a speech segment storage unit 152 for storing an original speech waveform (referred to below as “speech segment”) divided into speech synthesis units, and an associated information storage unit 151 in which attribute information of each speech segment is stored.
Here, the original speech waveform is a natural speech waveform collected in advance for use in generating synthesized speech.
The attribute information of the speech segments includes phonological information and prosody information such as phoneme context in which each speech segment is uttered; pitch frequency, amplitude, continuous time information, and the like.
In the speech synthesizing apparatus of FIG. 9, phonemes, CV, CVC, VCV (in this regard, V is a vowel and C is a consonant) and the like are often used in a speech synthesis unit. Details of length of speech segments and synthesis units are described in Non-Patent Documents 1 and 3.
The language processing unit 10 performs morphological analysis, syntax analysis, reading analysis and the like, on input text, and outputs a symbol sequence representing a “reading” of a phonemic symbol or the like, a morphological part of speech, conjugation, an accent type and the like, as language processing results, to the prosody generation unit 11 and the segment selection unit 16.
The prosody generation unit 11 generates prosody information (information on pitch, length of time, power, and the like) for the synthesized speech, based on the language processing result output from the language processing unit 10, and outputs the generated prosody information to the segment selection unit 16 and the prosody control unit 18.
The segment selection unit 16 selects speech segments having a high degree of compatibility with regard to the language processing result and the generated prosody information, from among speech segments stored in the speech segment information storage unit 15, and outputs the selected speech segment in conjunction with associated information of the selected speech segment to the prosody control unit 18.
The prosody control unit 18 generates a waveform having a prosody generated by the prosody generation unit 11, from the selected speech segments, and outputs the result to the waveform connection unit 19.
The waveform connection unit 19 connects the speech segments output from the prosody control unit 18 and outputs the result as synthesized speech.
The segment selection unit 16 obtains information (referred to as target segment environment) representing characteristics of target synthesized speech, from the input language processing result and the prosody information, for each prescribed synthesis unit.
The following may be cited as information included in the target segment environment:
respective phoneme names of phoneme in question, preceding phoneme, and subsequent phoneme,
presence or absence of stress,
distance from accent core,
pitch frequency and power for representative point, start point, and end point of a synthesis unit, and
continuous time length of unit.
Next, when the target segment environment is given, the segment selection unit 16 selects a plurality of speech segments matching specific information (mainly the phoneme in question) designated by the target segment environment, from the speech segment information storage unit 15. The selected speech segments form candidates for speech segments used in synthesis.
The segment selection unit 16, with regard to the selected candidate segments, calculates “cost” which is an index indicating suitability as speech segments used in the synthesis. Since generation of synthesized speech of high sound quality is a target, if the cost is small, that is, if the suitability is high, the sound quality of the synthesized sound is high. Therefore, the cost may be said to be an indicator for estimating deterioration of the sound quality of the synthesized speech.
The cost calculated by the segment selection unit 16 includes a unit cost and a concatenation cost.
Since the unit cost represents estimated sound quality deterioration produced by using candidate segments under the target segment environment, computation is executed based on degree of similarity of the segment environment of the candidate segments and the target segment environment.
On the other hand, since concatenation cost represents estimated sound quality deterioration level produced by a segment environment between concatenated speech segments being non-continuous, the cost is calculated based on affinity level of segment environments of adjacent candidate segments.
Various types of methods of calculation unit cost and concatenation cost have been proposed heretofore.
In general, information included in the target segment environment is used in the computation of the unit cost.
Pitch frequency, cepstrum, power, and A amount thereof (amount of change per unit time), with regard to concatenation boundary of a segment, are used in the concatenation cost.
The segment selection unit 16 calculates the concatenation cost and the unit cost for each segment, and then obtains a speech segment, for which both the concatenation cost and the unit cost are minimum, uniquely for each synthesis unit.
Since a segment obtained by cost minimization is selected as a segment most suited to speech synthesis from among the candidate segments, it is referred to as an “optimum segment”.
The segment selection unit 16 obtains respective optimal segments for entire synthesis units, and finally outputs a sequence of optimal segments (optimal segment sequence) as a segment selection result to the prosody control unit 18.
In the segment selection unit 16, as described above, the speech segments having a small unit cost are selected, that is, the speech segments having a prosody close to a target prosody (prosody included in the target segment environment) are selected, but it is rare for a speech segment having a prosody equivalent to the target prosody to be selected.
Therefore, in general, after the segment selection, in the prosody control unit 18 a speech segment waveform is processed to make a correction so that the prosody of the speech segment matches the target prosody.
As a representative method of correcting the prosody of the speech segment, a PSOLA (pitch-synchronous-overlap-add) method described in Non-Patent Document 4 is cited.
However, the prosody correction processing is a cause of degradation of synthesized speech. In particular, the change in pitch frequency has a large effect on sound quality degradation, and the larger the amount of the change, the larger is the sound quality deterioration.
For coping with this type of problem, development is taking place of a method of synthesizing with as small a prosody change amount as possible. For example, as in Non-Patent Documents 5 and 6, a method has been proposed in which a huge quantity of speech segments are prepared, and no correction at all of the prosody of the speech segments is carried out.
In this type of method, since the quantity of segments is very large, with regard to a certain input text, speech segments having a sufficiently high level of similarity with the target prosody are selected, and even if the prosody is not corrected, synthesized speech having natural prosody is generated.
However, there are problems such as that it is difficult to generate synthesized speech that always has natural prosody, an extremely large storage capacity is required, and the like.
Otherwise, in Non-Patent Document 7, an approach is taken in which an upper limit value is set for the change amount of the pitch frequency, segments are recorded that have various pitch frequencies, or the like.
[Patent Document 1]
JP Patent Kokai Publication No. JP-P2005-91551A
[Patent Document 2]
JP Patent Kokai Publication No. JP-P2006-84854A
[Non-Patent Document 1]
Huang, Acero, Hon: “Spoken Language Processing”, Prentice Hall, pp. 689-836, 2001.
[Non-Patent Document 2]
Ishikawa: “Prosodic Control for Japanese Text-to-Speech Synthesis”, The Institute of Electronics, Information and Communication Engineers, Technical Report, Vol. 100, No. 392, pp. 27-34, 2000.
[Non-Patent Document 3]
Abe: “An introduction to speech synthesis units”, The Institute of Electronics, Information and Communication Engineers, Technical Report, Vol. 100, No. 392, pp. 35-42, 2000.
[Non-Patent Document 4]
Moulines, Charapentier: “Pitch-Synchronous Waveform Processing Techniques For Text-To-Speech Synthesis Using Diphones”, Speech Communication 9, pp. 453-467, 1990.
[Non-Patent Document 5]
Segi, Takagi, Ito: “A CONCATENATIVE SPEECH SYNTHESIS METHOD USING CONTEXT DEPENDENT PHONEME SEQUENCES WITH VARIABLE LENGTH AS SEARCH UNITS”, Proceedings of 5th ISCA Speech Synthesis Workshop, pp. 115-120, 2004.
[Non-Patent Document 6]
Kawai, Toda, Ni, Tsuzaki, Tokuda: “XIMERA: A NEW TTS FROM AIR BASED ON CORPUS-BASED TECHNOLOGIES”, Proceedings of 5th ISCA Speech Synthesis Workshop, pp. 179-184, 2004.
[Non-Patent Document 7]
Koyama, Yoshioka, Takahashi, Nakamura: “High Quality Speech Synthesis Using Reconfigurable VCV Waveform Segments with Smaller Pitch Modification”, Transactions of the Institute of Electronics, Information and Communication Engineers, D-II, Vol. 183-D-II, No. 11, pp. 2264-2275, 2000.