Voice conversion technologies have historically been designed to convert a user's speaking voice to that of some target speaker. Typical systems convert only the voice color, or timbre, of the input voice to that of the target, while largely ignoring differences in pitch and speaking style, prosody, or cadence. Because speaking style contains an enormous amount of information about speaker identity, a usual result of this approach to conversion is an output that only partially carries the perceivable identity of the target voice to a human listener.
Style, cadence, and prosody are arguably even more important factors in generating a natural-sounding singing voice, at least because the melody of a given song is quite literally defined by the pitch progression of the singing voice, and at most because “style” is often the defining quality of a singer's identity. Converting or generating synthetic singing voices is thus complicated by the challenges inherent to speech prosody modeling.
To successfully achieve speech-to-singing voice conversion, a method for utilizing, obtaining, or otherwise generating a natural and stylistic pitch progression that follows the melody of the song is necessary. Further necessary is a technique for automatically imposing that progression on the target voice data in a way that avoids unnatural, digital artifacts, due to, for example, artificially adjusting the pitch of the target voice too far from its natural range.