Heretofore, various types of speech synthesis devices have been developed for analyzing text and generating synthesized speech by rule synthesis from speech information indicated by the text.
FIG. 9 is a block diagram showing a configuration of a conventional general rule synthesis type of speech synthesis device.
Details of configuration and operation of a speech synthesis device having this type of configuration are described, for example, in Non-Patent Documents 1 to 3 and in Patent Documents 1 and 2.
The speech synthesis device shown in FIG. 9 is provided with a language processing unit 10, prosody generation unit 11, segment selection unit 16, speech segment information storage unit 15, and waveform generation unit 17 that has prosody control unit 18 and waveform concatenation unit 19.
Speech segment information storage unit 15 has speech segment storage unit 152 that stores speech segments generated for each speech synthesis unit, and associated information storage unit 151 that stores associated information of each speech segment.
Here, the speech segments are often extracted from recorded natural speech waveforms, with information used for generating a waveform of synthesized speech. A speech waveform itself that has been clipped for each synthesis unit, or a linear prediction analysis parameter, cepstrum coefficient, or the like, may be cited as speech segment examples.
Furthermore, the associated information of a speech segment includes prosody information or phonology information such as phoneme environment of natural speech that is a source of extraction of each speech segment, pitch frequency, amplitude, duration time information, and the like.
In conventional speech synthesis devices, phonemes, CV, CVC, VCV (V is a vowel, C is a consonant) and the like are often used as the speech synthesis units. Details of length of these speech segments and synthesis units are described in Non-Patent Documents 1 and 3.
Language processing unit 10 performs morphological analysis, parsing, attachment of reading, and the like, with regard to inputted text, and outputs a symbol string representing a “reading” of a phonemic symbol or the like, a morphological part of speech, a conjunction, accent type or the like, as language processing results, to prosody generation unit 11 and segment selection unit 16.
Prosody generation unit 11 generates prosody information (information concerning pitch, time length, power, and the like) for the synthesized speech, based on the language processing results outputted from language processing unit 10, and outputs the generated prosody information to segment selection unit 16 and prosody control unit 18.
Segment selection unit 16 selects speech segments having a high degree of conformity with the language processing results and the generated prosody information, from among speech segments stored in speech segment information storage unit 15, and outputs the speech segments in conjunction with associated information of the selected speech segments to prosody control unit 18.
Prosody control unit 18 generates a waveform having a prosody close to a prosody generated by prosody generation unit 11, from the selected speech segments, and outputs to waveform concatenation unit 19.
Waveform concatenation unit 19 concatenates the speech segments outputted from the prosody control unit 18 and outputs the concatenated speech segments as synthesized speech.
Segment selection unit 16 obtains information (termed “target segment context” in the following) representing characteristics of target synthesized speech, from the inputted language processing results and the prosody information, for each prescribed synthesis unit.
The following may be cited as information included in the target segment context: respective phoneme names of a phoneme in question, a preceding phoneme, and a subsequent phoneme; presence or absence of stress; distance from accent core; pitch frequency and power for synthesis unit, continuous time length of unit, cepstrum, MFCC (Mel Frequency Cepstrum coefficients), and Δ amount thereof (change amount per unit time).
Next, when a target segment context is given, segment selection unit 16 selects a plurality of speech segments matching specific information (mainly the phoneme in question) designated by the target segment context, from within speech segment information storage unit 15. The selected speech segments form candidates for speech segments used in synthesis.
Then, “cost”, which is an index indicating suitability as speech segments used in the synthesis, is computed for the selected candidate segments.
Since generation of synthesized speech of high sound quality is an object, if the cost is small, that is, suitability level is high, the sound quality of the synthesized sound is high.
Therefore, the cost may he said to be an indicator for estimating degradation of sound quality of the synthesized speech.
Here, the cost computed by segment selection unit 16 includes unit cost and concatenation cost.
The unit cost represents estimated sound quality degradation produced by using candidate segments based on the target segment context. The cost is computed based on degree of similarity of the segment context of the candidate segments and the target segment context.
On the other hand, concatenation cost represents estimated sound quality degradation level produced by a segment context between concatenated speech segments being non-continuous. The cost is computed based on affinity level of segment contexts of adjacent candidate segments.
Various types of proposal for methods of computing unit costs and concatenation costs have been made heretofore.
In general, information included in the target segment context is used in the computation of the unit cost; and pitch frequency, cepstrum, MFCC, short time autocorrelation, power, Δ amount thereof, and the like, with regard to a concatenation boundary of a segment, are used in the concatenation cost.
In a case where a certain two segments are continuous in an original speech waveform, since the segment context between these segments is completely continuous, the continuous cost value is zero.
Furthermore, in a case where segments of synthesized unit length are continuous in the original speech waveform, these continuous segments are represented as “segments of long segment length”.
Therefore, it may be said that the larger the number of continuous occurrences, the longer the segment length will be. On the other hand, shortest segment length corresponds to synthesis unit length.
The concatenation cost and the unit cost are computed for each segment, and then a speech segment, for which both the concatenation cost and the unit cost are minimum, is obtained uniquely for each synthesis unit.
Since a segment obtained by cost minimization is selected as a segment that best fits speech synthesis from among the candidate segments, it is referred to as an “optimum segment”.
Segment selection unit 16 obtains respective optimal segments for all synthesis units, and finally outputs a sequence of optimal segments (optimal segment sequence) as a segment selection result to prosody control unit 18.
With regard to segment selection unit 16, as described above, speech segments having a small unit cost are selected.
However, speech segments having a prosody close to a target prosody (prosody information included in the target segment context) are selected, but it is rare for a speech segment having a prosody equal to the target prosody to be selected.
Therefore, in general, after the segment selection, in prosody control unit 18, a speech segment waveform is processed to make a correction so that the prosody of the speech segment matches the target prosody.
As a method of correcting the prosody of the speech segment, a method using an analysis method disclosed in Patent Document 4, for example, is cited.
According to the analysis method of Patent Document 4, plural cepstrums representing a spectrum envelope of the original speech waveform are obtained, and by driving a filter representing the plural cepstrums at a time interval corresponding to a desired pitch frequency, it is possible to re-configure speech waveform having the desired pitch frequency.
In addition, a PSOLA method described in Non-Patent Document 4 may be cited.
However, the prosody correction processing is a cause of degradation of synthesized speech. In particular, variations in pitch frequency have a large effect on sound quality degradation, and the larger the variation amount, the larger the sound quality degradation becomes.
On this account, if unit selection is performed with a criterion such that the sound quality degradation accompanying the prosody correction processing becomes sufficiently small (unit cost emphasis), segment concatenation distortion becomes conspicuous.
On the other hand, if segment selection is performed with a criterion such that the concatenation distortion becomes sufficiently small (concatenation cost emphasis), sound quality degradation accompanying prosody control becomes conspicuous.
Consequently, as a method of preventing the concatenation distortion and the sound quality degradation accompanying prosody control at same time, a method is considered in which various types of prosody information are prepared and unit selection is carried out, and a combination of a prosody and a unit selection result is selected so that sound degradation is minimized.
For example, Patent Document 3 proposes a method of repeating a frequency-directed parallel shift for a generated pitch pattern, and computation of unit selection score with the pitch pattern after the parallel shift as an target, and obtaining a parallel shift amount and unit selection result in which unit selection cost is smallest.
Furthermore, Non-Patent Document 5 proposes a method of firstly obtaining a combination of segments in which concatenation distortion is small, and of selecting a unit best fitting a target prosody from among them.
Furthermore, Non-Patent Document 6 proposes a method of selecting segments with maximizing of similarity with the target prosody and minimizing of concatenation distortion as criteria, and by generating synthesized speech without performing prosody control, concatenation distortion is reduced while preventing sound degradation accompanying prosody control.
[Patent Document 1]
JP Patent Kokai Publication No. JP-P2005-91551A
[Patent Document 2]
JP Patent Kokai Publication No. JP-P2006-84854A
[Patent Document 3]
JP Patent Kokai Publication No. JP-P2004-138728A
[Patent Document 4]
JP Patent No. 2812184
[Non-Patent Document 1]
Huang, Acero, Hon: “Spoken Language Processing,” Prentice Hall, pp. 689-836, 2001.
[Non-Patent Document 2]
Ishikawa: “Fundamentals of Prosody Control for Speech Synthesis,” The Institute of Electronics, Information and Communication Engineers, Technical Report, Vol. 100, No. 392, pp. 27-34, 2000.
[Non-Patent Document 3]
Abe: “Fundamentals of Synthesis Units for Speech Synthesis,” The Institute of Electronics, Information and Communication Engineers, Technical Report, Vol. 100, No. 392, pp. 35-42, 2000.
[Non-Patent Document 4]
Moulines, Charapentier: “Pitch-Synchronous Waveform Processing Techniques For Text-To-Speech Synthesis Using Diphones,” Speech Communication 9, pp. 453-467, 1990.
[Non-Patent Document 5]
Segi, Takagi, Ito: “A Concatenative Speech Synthesis Method Using Context Dependent Phoneme Sequences With Variable Length As Search Units,” Proceedings of 5th ISCA Speech Synthesis Workshop, pp. 115-120, 2004.
[Non-Patent Document 6]
Kawai, Toda, Ni, Tsuzaki, Tokuda: “XIMERA: A New TTS From ATR Based On Corpus-Based Technologies,” Proceedings of 5th ISCA Speech Synthesis Workshop, pp. 179-184, 2004.