With advances in text-to-speech (TTS) synthesis technology, recent years have witnessed the advent of numerous services and products that use human-like synthesized speech. Generally, TTS involves first getting the linguistic structure and other aspects of input text analyzed by morphological analysis (language analysis processing). The result of the analysis is then used as the basis for generating phoneme information furnished with accents and other information. Furthermore, based on pronunciation information, fundamental frequency patterns and phoneme duration time are estimated (prosody generation processing). On the basis of the prosody information and phoneme information thus generated, waveforms are ultimately generated (waveform generation processing). In the ensuing description, the fundamental frequency will be represented by F0 and the fundamental frequency patterns will be represented by the F0 patterns. The prosody information generated by prosody generation processing is information which designates the sound pitch and tempo of synthesized speech and which includes the F0 patterns and the duration time information about each phoneme, for example.
As one way to perform the above-mentioned prosody generation processing, there is a known method involving modeling the F0 patterns so that the F0 patterns can be represented by simple rules and using these rules to generate prosody information (e.g., see Non Patent Literature 1). The way to generate prosody information using rules, such as the method described in Non Patent Literature 1, has been used extensively because it can generate the F0 patterns in a simple model.
Also in recent years, speech synthesizing methods utilizing statistical techniques have been drawing attention. One such representative method is HMM speech synthesis that uses Hidden Markov Models (HMM) as the statistical technique (e.g., see Non Patent Literature 2). HMM speech synthesis involves generating speech using a prosody model and a speech synthesis unit (parameter) model prepared from large quantities of learning data. HMM speech synthesis utilizes the speech actually pronounced by humans as the learning data, so that this method can generate more human-like prosody information than the method of generating prosody information using rules described in Non Patent Literature 1.