Voice Conversion (VC) is a technique for allowing the speaker characteristics of speech to be altered. Non-linguistic information, such as the voice characteristics, is modified while keeping the linguistic information unchanged. Voice conversion can be used for speaker conversion in which the voice of a certain speaker (source speaker) is converted to sound like that of another speaker (target speaker).
The standard approaches to VC employ a statistical feature mapping process. This mapping function is trained in advance using a small amount of training data consisting of utterance pairs of source and target voices. The resulting mapping function is then required to be able to convert of any sample of the source speech into that of the target without any linguistic information such as phoneme transcription.
The normal approach to VC is to train a parametric model such as a Gaussian Mixture Model on the joint probability density of source and target spectra and derive the conditional probability density given source spectra to be converted.