A text-to-speech device is known that generates voice signals from input text. As one of the leading technologies used in text-to-speech devices, a text-to-speech technology based on the hidden Markov model (HMM) is known.
With the HMM-based text-to-speech technology, it is possible to generate voice signals that have voice quality of a desired speaker (target speaker) and desired speaking style (target speaking style). For example, it is possible to generate voice signals with a speaking style expressing the feeling of joy.
For generating voice signals having the target speaker's voice quality and target speaking style, there is a method to train an HMM in advance using the recorded voice samples uttered by the target speaker in the target speaking style, and then to use the trained HMM in synthesis time. However, this method requires a large cost for voice recording and phonetic labeling, since many utterances by the target speaker have to be recorded for all the target speaking styles.
Alternatively, regarding the method for generating voice signals having the target speaker's voice quality and the target speaking style, a method is known in which the voice signals having the target speaker's voice quality and a standard speaking style (i.e., a speaking style other than the target speaking style; for example, the speaking style of reading aloud in a calm manner) are modified with the characteristics of the target speaking style. Specific examples of this method include the two methods explained below.
In the first method, firstly, a standard speaking style HMM and a target speaking style HMM having the voice quality of the same speaker (a reference speaker) are created in advance. Then, using the voice samples uttered in the standard speaking style by the target speaker and the standard speaking style HMM having the reference speaker's voice quality, a new standard speaking style HMM having the target speaker's voice quality is created using the speaker adaptation technique. Moreover, using the correlation (the difference or the ratio) between the parameters of the standard speaking style HMM and the target speaking style HMM both having the reference speaker's voice quality, the standard speaking style HMM having the target speaker's voice quality is corrected to be a target speaking style HMM having the target speaker's voice quality. Then, voice signals having the target speaking style and the target speaker's voice quality are generated using the created target speaking style HMM having the target speaker's voice quality.
Meanwhile, characteristics in voice signals that are affected by the changes in the speaking style include globally-appearing characteristics and locally-appearing characteristics. The locally-appearing characteristics have context dependency that differs for each speaking style. For example, in speaking styles expressing the feeling of joy, the ending of words tends to have a rising pitch. On the other hand, in speaking styles expressing the feeling of sorrow, pauses tend to be longer. However, in the first embodiment, since the context dependency that differs for each speaking style is not taken into account, the locally-appearing characteristics of the target speaking style are difficult to be reproduced to a satisfactory extent.
In the second method, according to the cluster adaptive training (CAT), a statistical model that represents HMM parameters using linear combination of a plurality of cluster parameters is trained in advance using voice samples of a plurality of speakers with a plurality of speaking styles (including the standard speaking style and the target speaking style). Each cluster individually has a decision tree representing the context dependency. The combination of a particular speaker and a particular speaking style is expressed as a weight vector for making a linear combination of cluster parameters. A weight vector is formed by concatenating a speaker weight vector and a speaking style weight vector. In order to generate voice signals having the characteristics of the target speaker's voice quality and speaking style, firstly, CAT-based speaker adaptation is performed using the voice samples having the characteristics of the target speaker's voice quality and standard speaking style, and a speaker weight vector representing the target speaker is calculated. Then, the speaker weight vector representing the target speaker is concatenated with a speaking style vector representing the target speaking style calculated in advance to create a weight vector that represents the target speaking style having the target speaker's voice quality. Subsequently, using the created weight vector, voice signals having the target speaking style and target speaker's voice quality are generated.
In the second method, since each cluster individually has a decision tree, it becomes possible to reproduce the context dependency that differs for each speaking style. However, in the second method, the speaker adaptation needs to be performed in the CAT framework. Hence, as compared to the speaker adaptation performed according to the maximum likelihood linear regression (MLLR), it cannot reproduce the target speaker's voice quality precisely.
In this way, in the first method, since the context dependency that differs for each speaking style is not taken into account, the target speaking style cannot be reproduced to a satisfactory extent. Moreover, in the second method, since the CAT framework needs to be used for speaker adaptation, the target speaker's voice quality cannot be reproduced precisely.