With the recent development of voice synthesis technologies, high-quality synthetic sounds have been able to be generated. Voice synthesis technologies using the hidden Markov model (HMM) are known to flexibly control a synthetic sound with a model obtained by parameterizing voices. Technologies for generating various types of synthetic sounds have been in practical use, including a speaker adaptation technology for generating a high-quality synthetic sound from a small amount of recorded voice and an emotional voice technology for synthesizing an emotional voice, for example.
Under the circumstances described above, synthetic sounds have been applied to a wider range of fields, such as reading out of electronic books, digital signage, dialog agents, entertainment, and robots. In such applications, a user desires to generate a synthetic sound not only of a voice of a speaker prepared in advance but also of a desired voice. To address this, there have been developed technologies of voice quality editing of changing parameters of an acoustic model of an existent speaker or generating a synthetic sound having the voice quality of a non-existent speaker by combining a plurality of acoustic models.
The conventional technologies of voice quality editing mainly change parameters themselves of an acoustic model or reflect specified characteristics of voice quality (e.g., a high voice and a voice of rapid speech) directly connected to the parameters of the acoustic model. The voice quality desired by a user, however, tends to be precisely expressed by a more abstract word, such as a cute voice and a fresh voice. As a result, there have been increasing demands for a technology for generating a synthetic sound having a desired voice quality by specifying the voice quality based on an abstract word.