Conventionally, in the field of voice conversion (a technique in which only information about the individuality of an input speaker is converted into that of an output speaker, while phonological information of a speech of the input speaker is held), a parallel voice conversion is a mainstream technique in which parallel data (a speech pair based on the same utterance content uttered both by an input speaker and by an output speaker) is used when performing model learning.
As the parallel voice conversion, various statistical approaches are proposed, such as a method based on GMM (Gaussian Mixture Model), a method based on NMF (Non-negative Matrix Factorization), a method based on DNN (Deep Neural Network) and the like (see PTL 1). In the parallel voice conversion, although higher accuracy can be achieved due to the parallel constraint, it is necessary to bring the utterance content of the input speaker in line with the utterance content of output speaker, in the learning data, and which impairs the convenience.
In contrast, a non-parallel voice conversion (a technique in which the parallel data is not used when performing model learning) is attracting increasing attention. Although inferior to the parallel voice conversion in accuracy, the non-parallel voice conversion can perform learning using free utterance, and therefore is superior in terms of convenience and usefulness. NPL 1 discloses a technique in which a plurality of parameters are previously learned using a speech of an input speaker and a speech of an output speaker, and thereby convert the voice of the input speaker into the voice of the output speaker, wherein either one of the input speaker and the output speaker in contained in the learning data.