One of the current challenges in speech technology is the transform of speech of one individual so that it sounds like the voice of another individual. This task is commonly referred to as voice conversion (VC). The main challenge in voice conversion is the transformation of the acoustic properties of the voice that form the basis of perceptual discrimination and identification of an individual. The voice height (pitch), for example, is believed to provide the main perceptual clue to discriminate between different speakers, while the way of speaking (e.g. prosody) and the timbre of the voice are important to identification of a particular individual's voice.
The prosody can be briefly described as the way in which the pitch of the voice progresses at a segmental (i.e. phrase) and supra-segmental levels. Most of the current voice conversion strategies do not process prosodic or short-term pitch information and focus, instead, on matching the overall pitch statistics (mean and variance) of the “source” voice to those of the “target” voice.
Voice timbre is generally based on the human vocal system, particularly the shape and length of the vocal tract. Vocal track length differs widely across individuals of different genders and ages. By modifying the speech waveform spectra to reflect differences in voice timbre, it is possible to transform the perceived identity, gender, or age of the voice.
The techniques for altering the vocal track length conditions of one voice to another are commonly referred to as Vocal-Tract Length Normalization (VTLN). Typically, these VTLN techniques estimate a frequency-warping based function that better matches the frequency axis of the source voice to that of the target voice. VTLN may be applied to map the timbre as the first step during Voice Conversion. Although the resulting sound quality may be artifact-free, VTLN does not generally lead to a close perception of the timbre of the target voice.
Determination of a frequency-warping based transformation that leads to a convincing VTLN perceived effect is challenging for multiple reasons. Firstly, it's difficult to define a convenient correspondence of the features between source and target spectra. Secondly, it is difficult to ensure a convenient progression over time if the transformation is updated on a short-term basis. There is therefore a need for a voice conversion technique that maps features between source and target spectra in a manner that accurately accounts for differences in timbre between the source and target voices.