The present invention relates to voice conversion and, more particularly, to codebook-based voice conversion systems and methodologies.
A voice conversion system receives speech from one speaker and transforms the speech to sound like the speech of another speaker. Voice conversion is useful in a variety of applications. For example, a voice recognition system may be trained to recognize a specific person""s voice or a normalized composite of voices. Voice conversion as a front-end to the voice recognition system allows a new person to effectively utilize the system by converting the new person""s voice into the voice that the voice recognition system is adapted to recognize. As a post processing step, voice conversion changes the voice of a text-to-speech synthesizer. Voice conversion also has applications in voice disguising, dialect modification, foreign-language dubbing to retain the voice of an original actor, and novelty systems such as celebrity voice impersonation, for example, in Karaoke machines.
In order to convert speech from a xe2x80x9csourcexe2x80x9d voice to a xe2x80x9ctargetxe2x80x9d voice, codebooks of the source voice and target voice are typically prepared in a training phase. A codebook is a collection of xe2x80x9cphones,xe2x80x9d which are units of speech sounds that a person utters. For example, the spoken English word xe2x80x9ccatxe2x80x9d in the General American dialect comprises three phones [K], [AE], and [T], and the word xe2x80x9ccotxe2x80x9d comprises three phones [K], [AA], and [T]. In this example, xe2x80x9ccatxe2x80x9d and xe2x80x9ccotxe2x80x9d share the initial and final consonants but employ different vowels. Codebooks are structured to provide a one-to-one mapping between the phone entries in a source codebook and the phone entries in the target codebook.
U.S. Pat. No. 5,327,521 describes a conventional voice conversion system using a codebook approach. An input signal from a source speaker is sampled and preprocessed by segmentation into xe2x80x9cframesxe2x80x9d corresponding to a speech unit. Each frame is matched to the xe2x80x9cclosestxe2x80x9d source codebook entry and then mapped to the corresponding target codebook entry to obtain a phone in the voice of the target speaker. The mapped frames are concatenated to produce speech in the target voice. A disadvantage with this and similar conventional voice conversion systems is the introduction of artifacts at frame boundaries leading to a rather rough transition across target frames. Furthermore, the variation between the sound of the input speech frame and the closest matching source codebook entry is discarded, leading to a low quality voice conversion.
A common cause for the variation between the sounds in speech and in codebook is that sounds differ depending on their position in a word. For example, the /t/ phoneme has several xe2x80x9callophones.xe2x80x9d At the beginning of a word, as in the General American pronunciation of the word xe2x80x9ctopxe2x80x9d, the /t/ phoneme is an unvoiced, fortis, aspirated, alveolar stop. In an initial cluster with an /s/, as in the word xe2x80x9cstop,xe2x80x9d it is an unvoiced, fortis, unaspirated, alveolar stop. In the middle of a word between vowels, as in xe2x80x9cpotter,xe2x80x9d it is an alveolar flap. At the end of a word, as in xe2x80x9cpot,xe2x80x9d it is an unvoiced, lenis, unaspriated, alveolar stop. Although the allophones of a consonant like /t/ are pronounced differently, a codebook with only one entry for the /t/ phoneme will produce only one kind of /t/ sound and, hence, unconvincing output. Prosody also accounts for differences in sound, since a consonant or vowel will sound somewhat different when spoken at a higher or lower pitch, more or less rapidly, and with greater or lesser emphasis.
Accordingly, one conventional attempt to improve voice conversion quality is to greatly increase the amount of training data and the number of codebook entries to account for the different allophones of the same phoneme and different prosodic conditions. Greater codebook sizes lead to increased storage and computational costs. Conventional voice conversion systems also suffer in a loss of quality because they typically perform their codebook mapping in an acoustic space defined by linear predictive coding coefficients. Linear predictive coding is an all-pole modeling of speech and, hence, does not adequately represent the zeroes in a speech signal, which are more commonly found in nasal and sounds not originating at the glottis. Linear predictive coding also has difficulties with higher pitched sounds, for example, women""s voices and children""s voices.
There exists a need for a voice conversion system and methodology having improved quality output, but preferably still computationally tractable. Differences in sound due to word position and prosody need to be addressed without increasing the size of codebooks. Furthermore, there is a need to account for voice features that are not well supported by linear predictive coding, such as the glottal excitation, nasalized sounds, and sounds not originating at the glottis.
Accordingly, one aspect of the invention is a method and a computer-readable medium bearing instructions for transforming a source signal representing a source voice into a target signal representing a target voice. The source signal is preprocessed to produce a source signal segment, which is compared with source codebook entries to produce corresponding weights. The source signal segment is transformed into a target signal segment based on the weights and corresponding target codebook entries and post processed to generate the target signal. By computing a weighted average, a composite source voice can be mapped to a corresponding composite target voice, thereby reducing artifacts at frame boundaries and leading to smoother transitions between frame boundaries without having to employ a large number of codebook entries.
In another aspect of the invention, the source signal segment is compared with the source codebook entries as line spectral frequencies to facilitate the computation of the weighted average. In still another aspect of the invention, the weights are refined by a gradient descent analysis to further improve voice quality. In a further aspect of the invention, both vocal tract characteristics and excitation characteristics are transformed according to the weights, thereby handling excitation characteristics in a computationally tractable manner.
Additional needs, objects, advantages, and novel features of the present invention will be set forth in part in the description that follows, and in part, will become apparent upon examination or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims.