1. Field of the Invention
The present invention relates to a voice conversion apparatus and method which convert the voice quality of source speech into that of target speech.
2. Description of the Related Art
A technique of inputting source speech and converting its voice quality into that of target speech is called a voice conversion technique. According to the voice conversion technique, first of all, spectral information of speech is represented by a spectral parameter, and a voice conversion rule is learned from the relationship between a source spectral parameter and a target spectral parameter. Then, a spectral parameter that is obtained by analyzing arbitrary source input speech is converted into a target spectral parameter by using the voice conversion rule. The voice quality of the input speech is converted into target voice quality by synthesizing a speech waveform from the obtained spectral parameter.
As a method for voice conversion, a voice conversion method of performing voice conversion based on a mixture Gaussian distribution (GMM) is disclosed (see, for example, reference 1 [Y. Stylianou et al., “Continuous Probabilistic Transform for Voice Conversion”, IEEE Transactions of Speech and Audio Processing, Vol. 6, No. 2, March 1988]). According to reference 1, a GMM is obtained from source speech spectral parameters, and a regression matrix in each mixture of a GMM is obtained by performing regression analysis on a pair of a source spectral parameter and a target spectral parameter. This regression matrix is used as a voice conversion rule. In applying voice conversion, a target spectral parameter is obtained by using a regression matrix after weighting by the probability that an input source speech spectral parameter is output in each mixture of a GMM.
In GMM regression analysis, learning is performed so as to minimize an error by using a cepstrum as a spectral parameter. It is, however, difficult to properly perform voice conversion of a component representing an aperiodic characteristic of a spectrum, e.g., the high-frequency component of the spectrum. As a result, the voice-converted speech exhibits a muffled sense and a sense of noise.
There is disclosed a voice conversion apparatus which performs conversion/grouping of frequency warping functions and spectrum slopes generated for each phoneme and performs voice conversion by using an average frequency warping function and spectrum slope of each group, thereby converting the voice quality spectrum of the first speaker into the voice quality spectrum of the second speaker (see reference 2: Japanese Patent No. 3631657). A frequency warping function is obtained by nonlinear frequency matching, and a spectrum slope is obtained by a least-squares approximated slope. Conversion is performed based on a slope difference.
Although a frequency warping function is properly obtained for a clearly periodic component having a formant structure, it is difficult to obtain such a function for a component representing an aperiodic characteristic of a spectrum such as the high-frequency component of the spectrum. Conversion by slope correction is thought to be difficult to increase the similarity with a target speaker because of strong constraints from the conversion rules. As a result, the voice-converted speech exhibits a muffled sense or a sense of noise, and the similarity with the target voice quality decreases.
A technique of inputting an arbitrary sentence and generating a speech waveform is called “text speech synthesis”. Text speech synthesis is generally performed in three steps in a language processing unit, a prosodic processing unit, and a speech synthesis unit. First of all, the language processing unit performs text analysis such as morphemic analysis, syntactic analysis, for an input text. The prosodic processing unit performs accent processing and intonation processing to output phoneme sequence/prosodic information (fundamental frequency, phoneme duration time, and the like). Finally, the speech waveform generation unit generates a speech waveform from the phoneme sequence/prosodic information.
As one of speech synthesis methods, there is a segment-selection speech synthesis method which selects and synthesizes speech segment sequences from a speech segment database containing a large quantity of speech segments, considering input phoneme sequence/prosodic information as objective information. In segment-selection speech synthesis, speech segments are selected from a large quantity of speech segments stored in advance based on input phoneme sequence/prosodic information, and the selected speech segments are connected to synthesize speech. In addition, there is available a plural-segment-selection speech synthesis method which selects a plurality of speech segments for each synthesis unit of an input phoneme sequence based on the degree of distortion of synthetic speech, considering input phoneme sequence/prosodic information as objective information, generates new speech segments by fusing the plurality of selected speech segments, and synthesizes speech by conatenating them. As a fusing method, for example, a method of averaging pitch waveforms is used.
There is disclosed a method of performing voice conversion of a speech segment database for text speech synthesis such as the above segment-selection speech synthesis or plural-segment-selection speech synthesis by using a small amount of target speech data as objective data (see reference 3: JP-A 2007-193139(KOKAI)). According to reference 3, voice conversion rules are learned by using a large amount of source speech data and a small amount of target speech data, and the obtained voice conversion rules are applied to a source speech segment database for speech synthesis, thereby implementing speech synthesis of an arbitrary sentence with target voice quality. In reference 3, voice conversion rules are based on the method disclosed in reference 1, and it is difficult to properly perform voice conversion of aperiodic component such as the high-frequency component of a spectrum as in reference 1. As a result, the voice-converted speech exhibits a muffled sense or a sense of noise.
As described above, according to references 1 and 3 as conventional techniques, voice conversion is performed based on a technique such as regression analysis for spectral data. According to reference 2, voice conversion is performed by using frequency warping and slope correction. However, it is difficult to properly convert the aperiodic component of a spectrum. As a result, the speech obtained by voice conversion sometimes exhibits a muffled sense or a sense of noise, resulting in a reduction in similarity with target voice quality.
Assume that all spectral components are generated by using target speech. In this case, if only a small amount of target speech is stored in advance, it is impossible to generate proper target speech.