This invention relates to the transformation of a person""s voice according to a target voice. More particularly, this invention relates to a transformation system where recorded information of the target voice can be used to guide the transformation process. It further relates to the transformation of a singer""s voice to adopt certain characteristics of a target singer""s voice, such as pitch and other prosodic factors.
There are a number of applications where it may be desirable to transform a person""s voice (the source vocal signal) into a different person""s voice (the target vocal signal). This invention performs such a transformation and is suited to applications where a recording of the target voice is available for use in the transformation process. Such applications include Automatic Dialogue Replacement (ADR) and Karaoke. We have chosen to describe the karaoke application because of the additional demands for accurate pitch processing in such a system but the same principles apply for a spoken-word system.
Karaoke allows the participants to sing songs made popular by other artists. The songs produced for karaoke have the vocal track removed leaving behind only the musical accompaniment. In Japan, karaoke is the second largest leisure activity, after dining out. Some people, however, cannot participate in the karaoke experience because they are unable to sing in the correct pitch.
Often, as part of the karaoke experience, the singer tries to mimic the style and sound of the artist who originally made the recording. This desire for voice transformation is not limited to karaoke but is also important for impersonators who might mimic, for example, Elvis Presley performing one of his songs.
Most of the research in voice transformation has related to the spoken voice as opposed to the sung voice. H. Kuwabara and Y. Sagisaka, Acoustic characteristics of speaker individuality: Control and conversion, Speech Communication, vol. 16, 1995 separated the factors responsible for voice individuality into two categories:
physiological factors (e.g. length of the vocal tract, glottal pulse shape, and position and bandwidth of the formants), and
socio-linguistic and psychological factors, or prosodic factors (e.g. pitch contour, duration of words, timing and rhythm)
The bulk of the research into voice transformation has focused on the direct conversion of the physiological factors, particularly vocal tract length compensation and formant position/bandwidth transformation. Although it appears to be recognized that the most important factors for voice individuality are the prosodic factors, current speech technologies have not allowed useful extraction and manipulation of the prosodic features and have instead focused on direct mapping of vocal characteristics.
The inventors have found that the important characterizing parameters for successful voice conversion to a specified target depend on the target singer. For some singers, the pitch contour at the onset of notes (for example the xe2x80x9cscoopingxe2x80x9d style of Elvis Presley) is critical. Other singers may be recognized more for the xe2x80x9cgrowlxe2x80x9d in their voice (e.g. Louis Armstrong). The style of vibrato is another important factor of voice individuality. These examples all involve prosodic factors as the key characterizing features. While physiological factors are also important, we have found that the transformation of physiological parameters need not be exact in order to achieve a convincing identity transformation. For example it may be enough to transform the perceived vocal-tract length without having to transform the individual formant locations and bandwidths.
The present invention provides a method and apparatus for transforming the vocal characteristics of a source singer into those of a target singer. The invention relies on the decomposition of a signal from a source singer into excitation and vocal tract resonance components. It further relies on the replacement of the excitation signal of the source singer with an excitation signal derived from a target singer. This disclosure also presents methods of shifting the timbre of the source singer into that of the target singer by modifying the vocal tract resonance model. Additionally, pitch-shifting methods may be used to modify the pitch contour to better track the pitch of the source singer.
According to the invention, the excitation component and pitch contour of the vocal signal of the target singer are first obtained. This is done by essentially extracting the excitation signal and pitch data from the target singer""s voice and storing them for use in the vocal transformer.
The invention allows the transformation of voice either with or without pitch correction to match the pitch of the target singer. When used to transform voice with pitch correction, the source singer""s vocal signal is converted from analog to digital data, and then separated into segments. For each segment, a voicing detector is used to determine whether the signal contains voiced or unvoiced data. If the signal contains unvoiced data, the signal is sent to the digital to analog converter to be played on the speaker. If the segment contains voiced data, the signal is analyzed to determine the shape of the spectral envelope which is then used to produce a time-varying synthesis filter. If timbre and/or gender shifting or other vocal transformations are also desired, or in cases where doing so will improve the results (e.g., where the spectral shapes of the source and target voices are very different) the spectral envelope may first be transformed, then used to create the time-varying synthesis filter. The transformed vocal signal is then created by passing the target excitation signal through the synthesis filter. Finally, the amplitude envelope of the untransformed source vocal signal is used to shape the amplitude envelope of the transformed source vocal.
When used as a voice transformer without pitch correction, two extra steps are performed. First the pitch of the source vocal is extracted. Then the pitch of the target excitation is shifted using a pitch shifting algorithm so that the target excitation pitch is made to track the pitch of the source vocal.