Technique to convert a speech of a source speaker's voice to the speech of a target speaker's voice is called “voice conversion technique”. As to the voice conversion technique, spectral information of speech is represented as a parameter, and a voice conversion rule is trained (determined) from the relationship between a spectral parameter of a source speaker and a spectral parameter of a target speaker. Then, a spectral parameter is calculated by analyzing an arbitrary input speech of the source speaker, and the spectral parameter is converted to a spectral parameter of the target speaker by applying the voice conversion rule. By synthesizing speech waveforms from the spectral parameter of the target speaker, the voice of the input speech is converted to the target speaker's voice.
As one method for converting voice, a voice conversion algorithm based on Gaussian mixture model (GMM) is disclosed in “Continuous Probabilistic Transform for Voice Conversion, Y. Stylianou et al., IEEE Transactions on Speech and Audio Processing, Vol. 6, No. 2, March 1998” (non-patent reference 1). In this algorithm, GMM is calculated from a spectral parameter of a source speaker's speech, a regression matrix of each mixture of GMM is calculated by regressively analyzing a pair of the source speaker's spectral parameter and the target speaker's spectral parameter, and the regression matrix is set as a voice conversion rule.
In case of applying the voice conversion rule, a regression matrix is weighted with a probability that spectral parameter of the source speaker's speech is output at each mixture of GMM, and a spectral parameter of the target speaker's voice is obtained using the regression matrix. Calculation of weighted sum by output probability of GMM is regarded as interpolation of regressive analysis based on likelihood of GMM. However, in this case, a spectral parameter is not always interpolated along temporal direction of speech, and spectral parameters smoothly adjacent are not always smoothly adjacent after conversion.
Furthermore, Japanese Patent No. 3703394 discloses a voice conversion apparatus by interpolating a spectral envelope conversion rule of a transition section (patent reference 1). In the transition section between phonemes, a spectral envelope conversion rule is interpolated, so that a spectral envelope conversion rule of a previous phoneme of the transition section is smoothly transformed to a spectral envelope conversion rule of a next phoneme of the transition section.
In the patent reference 1, straight line-interpolation of spectral envelope conversion rule is disclosed. However, this method is not based on assumption that the spectral envelope conversion rule is interpolated along temporal direction in case of training the conversion rule. Briefly, interpolation method for conversion rule training is not matched with interpolation method for actual conversion processing. Furthermore, speech temporal change is not always straight, and quality of converted voice often falls. Even if the conversion rule is trained based on above assumption, restriction for parameter of the conversion rule increases during training. As a result, estimation accuracy of the conversion rule falls, and similarity between the converted voice and the target speaker's voice also falls.
Artificial generation of a speech signal from an arbitrary sentence is called “text speech synthesis”. In general, the text speech synthesis includes three steps of language processing, prosody processing, and speech synthesis. First, a language processing section morphologically and semantically analyzes an input text. Next, a prosody processing section processes accent and intonation of the text based on the analysis result, and outputs a phoneme sequence/prosodic information (fundamental frequency, phoneme segmental duration). Last, speech synthesis section synthesizes a speech waveform based on the phoneme sequence/prosodic information. As one speech synthesis method, by setting input phoneme sequence/prosodic information as a target, a speech synthesis method of unit selection type for selecting a speech unit sequence from a speech unit database (storing a large number of speech units) and for synthesizing the speech unit sequence is known. In this method, a plurality of speech units is selected from the large number of speech units (previously stored) based on input phoneme sequence/prosodic information, and a speech is synthesized by concatenating the plurality of speech units.
Furthermore, a speech synthesis method of plural unit selection type is also known. In this method, by setting input phoneme sequence/prosodic information as a target, as to each synthesis unit of the input phoneme sequence, a plurality of speech units is selected based on distortion of a synthesized speech, a new speech unit is generated by fusing the plurality of speech units, and a speech is synthesized by concatenating fused speech units. As a fusion method, for example, a pitch waveform is averaged.
As above-mentioned unit selection types, using a small number of speech data of a target speaker, a method for converting speech units (stored in a database of text speech synthesis) is disclosed in “Voice conversion for plural speech unit selection and fusion based speech synthesis, M. Tamura et al., Spring meeting, Acoustic Society of Japan, 1-4-13, March 2006” (non-patent reference 2). In this reference, a voice conversion rule is trained using a large number of speech data of a source speaker and a small number of speech data, and an arbitrary sentence with voice of the target speaker is synthesized by applying the voice conversion rule to a speech unit database of the source speaker. However, the voice conversion rule is based on the method in the non-patent reference 1. Accordingly, in the same way as the non-patent reference 1, a converted spectral parameter is not always smooth in temporal direction.
In the non-patent references 1 and 2, a voice conversion rule based on a model is created while training the conversion rule. However, the conversion rule is not always interpolated (not always smooth) along the temporal direction.
In the patent reference 1, a voice at a transition section is smoothly converted along temporal direction. However, this method is not based on the assumption that a conversion rule is interpolated along temporal direction while training the conversion rule. Briefly, the interpolation method for training the conversion rule is not matched to the interpolation method for actual conversion processing. Furthermore, speech temporal change is not always straight, and quality of converted voice often falls. Even if the conversion rule is trained based on above assumption, restriction for parameter of the conversion rule increases during training. As a result, estimation accuracy of the conversion rule falls, and similarity between the converted voice and the target speaker's voice also falls.