In recent years, creation of synthetic speech with significantly high sound quality has become possible with the development of speech synthesis technologies. As a speech synthesis device which provides high real-voice feel, there is a speech synthesis device which uses a waveform concatenation method of selecting speech waveforms from a large segment storage unit and concatenating the speech waveforms (for example, see Patent Literature (PTL) 1). FIG. 17 is a diagram showing a typical configuration of a waveform concatenation speech synthesis device.
The speech synthesis device shown in FIG. 17 includes a language analysis unit 501, a prosody generation unit 502, a speech segment database (DB) 503, a segment selection unit 504, and a waveform concatenation unit 505.
The language analysis unit 501 linguistically analyzes text that has been input, and outputs pronunciation symbols and accent information. The prosody generation unit 502 generates, for each of the phonetic symbols, prosody information such as a fundamental frequency, a duration, and power, based on the pronunciation symbols and accent information output by the language analysis unit 501. The speech segment DB 503 is a segment storage unit for storing speech waveforms as pre-recorded pieces of speech segment data (hereafter referred to simply as “speech segments”). The segment selection unit 504 selects optimum speech segments from the speech segment DB 503, based on the prosody information generated by the prosody generation unit 502. The waveform concatenation unit 505 generates synthetic speech by concatenating the speech segments selected by the segment selection unit 504.