In recent years, development of speech synthesis technologies has allowed synthetic speeches to have significantly high sound quality.
However, conventional applications of synthetic speeches are mainly reading of news texts by broadcaster-like voice, for example.
In the meanwhile, in services of mobile telephones and the like, a speech having a feature (a synthetic speech having a high individuality reproduction, or a synthetic speech with prosody/voice quality having features such as high school girl delivery or Japanese Western dialect) has begun to be distributed as one content. For example, service of using a message spoken by a famous person instead of a ring-tone is provided. In order to increase entertainments in communication between individuals as the above example, a desire for generating a speech having a feature and presenting the generated speech to a listener will be increased in the future.
A method of synthesizing a speech is broadly classified into the following two methods: a waveform connection speech synthesis method of selecting appropriate speech elements from prepared speech element databases and connecting the selected speech elements to synthesize a speech; and an analytic-synthetic speech synthesis method of analyzing a speech and synthesizing a speech based on a parameter generated by the analysis.
In consideration of varying voice quality of a synthetic speech as mentioned previously, the waveform connection speech synthesis method needs to have speech element databases corresponding to necessary kinds of voice qualities and connect the speech elements while switching among the speech element databases. This requires a significant cost to generate synthetic speeches having various voice qualities.
On the other hand, the analytic-synthetic speech synthesis method can convert voice quality of a synthetic speech by converting an analyzed speech parameter. An example of a method of converting such a parameter is a method of converting the parameter using two different utterances both of which are related to the same utterance content.
Patent Reference 1 discloses an example of an analytic-synthetic speech synthesis method using learning models such as a neural network.
FIG. 1 is a diagram showing a configuration of a speech processing system using an emotion addition method of Patent Reference 1.
The speech processing system shown in FIG. 1 includes an acoustic analysis unit 2, a spectrum Dynamic Programming (DP) matching unit 4, a phoneme-based duration extending/shortening unit 6, a neural network unit 8, a rule-based synthesis parameter generation unit, a duration extending/shortening unit, and a speech synthesis system unit. The speech processing system has the neural network unit 8 perform learning in order to convert an acoustic feature parameter of a speech without emotion into an acoustic feature parameter of a speech with emotion, and then adds emotion to the speech without emotion using the learned neural network unit 8.
The spectrum DP matching unit 4 examines a degree of similarity between a speech without emotion and a speech with emotion regarding feature parameters of spectrum among feature parameters extracted by the acoustic analysis unit 2 with time, then determines a temporal correspondence between identical phonemes, and thereby calculates a temporal extending/shortening rate of the speech with emotion to the speech without emotion for each phoneme.
The phoneme-based duration extending/shortening unit 6 temporally normalizes a time series of feature parameters of the speech with emotion to match the speech without emotion, according to the temporal extending/shortening rate for each phoneme generated by the spectrum DP matching unit 4.
In the learning, the neural network unit 8 learns differences between (i) acoustic feature parameters of the speech without emotion provided to an input layer with time and (ii) acoustic feature parameters of the speech with emotion provided to an output layer.
In addition, in the emotion addition, the neural network unit 8 performs calculation to estimate acoustic feature parameters of the speech with emotion from the acoustic feature parameters of the speech without emotion provided to the input layer with time, using weighting factors in a network decided in the learning. The above converts the speech without emotion to the speech with emotion based on the learning model.
However, the technology of Patent Reference 1 needs to record the same content as a predetermined learning text by speaking the content with a target emotion. Therefore, when the technology of Patent Reference 1 is used to speaker conversion, all of the predetermined learning text needs to be spoken by a target speaker. This causes a problem of increasing a load on the target speaker.
A method by which such a predetermined learning text does not need to be spoken is disclosed in Patent Reference 2. By the method disclosed in Patent Reference 2, the same content as a target speech is synthesized by a text-to-speech synthesis device, and a conversion function of a speech spectrum shape is generated using a difference between the synthesized speech and the target speech.
FIG. 2 is a block diagram of a voice quality conversion device of Patent Reference 2.
A speech signals of a target speaker is provided to a target speaker speech receiving unit 11a, and the speech recognition unit 19 performs speech recognition on the speech of the target speaker (hereinafter, referred to as a “target-speaker speech”) provided to the target speaker speech receiving unit 11a and provides a pronunciation symbol sequence receiving unit 12a with a spoken content of the target-speaker speech together with pronunciation symbols. The speech synthesis unit 14 generates a synthetic speech using a speech synthesis database in a speech synthesis data storage unit 13 according to the provided pronunciation symbol sequence. The target speaker speech feature parameter extraction unit 15 analyzes the target-speaker speech and extracts feature parameters, and the synthetic speech feature parameter extraction unit 16 analyzes the generated synthetic speech and extracts feature parameters. The conversion function generation unit 17 generates functions for converting a spectrum shape of the synthetic speech to a spectrum shape of the target-speaker speech using both of the feature parameters. The voice quality conversion unit 18 converts voice quality of the input signals applying the generated conversion functions.
As described above, since a result of the speech recognition of the target-speaker speech is provided to the speech synthesis unit 14 as a pronunciation symbol sequence used for synthetic speech generation, a user does not need to provide a pronunciation symbol sequence by inputting a text or the like, which makes it possible to automate the processing.
Moreover, a speech synthesis device that can generate a plurality kinds of voice quality using a small amount of memory capacity is disclosed in Patent Reference 3. The speech synthesis device according to Patent Reference 3 includes an element storage unit, a plurality of vowel element storage units, and a plurality of pitch storage units. The element storage unit holds consonant elements including glide parts of vowels. Each of the vowel element storage units holds vowel elements of a single speaker. Each of the pitch storage units holds a fundamental pitch of the speaker corresponding to the vowel elements.
The speech synthesis device reads out vowel elements of a designated speaker from the plurality of vowel element storage units, and connects predetermined consonant elements stored in the element storage unit so as to synthesize a speech. Thereby, it is possible to convert voice quality of an input speech to voice quality of the designated speaker.    Patent Reference 1: Japanese Unexamined Patent Application Publication No. 7-72900 (pages 3-8, FIG. 1)    Patent Reference 2: Japanese Unexamined Patent Application Publication No. 2005-266349 (pages 9-10, FIG. 2)    Patent Reference 3: Japanese Unexamined Patent Application Publication No. 5-257494