1. Field of the Invention
The present invention relates to an apparatus and a method of processing speech in which rules for converting the speech of a conversion-source speaker to that of a conversion-target speaker are made.
2. Description of the Related Art
A technique of inputting the speech of a conversion-source speaker and converting the voice quality to that of a conversion-target speaker is called a voice conversion technique. In this voice conversion technique, speech spectrum information is expressed as parameters, and voice conversion rules are learned from the relationship between the spectrum parameters of the conversion-source speaker and the spectrum parameters of the conversion-target speaker. Any input speech of the conversion-source speaker is analyzed to obtain spectrum parameters, which are converted to those of the conversion-target speaker by application of the voice conversion rules, and a speech waveform is synthesized from the obtained spectrum parameters. The voice quality of the input speech is thus converted to the voice quality of the conversion-target speaker.
One method of the voice conversion is a method of voice conversion in which conversion rules are learned based on a Gaussian mixture model (GMM). (e.g., refer to Nonpatent Document 1: Y. Stylianou, et al., “Continuous Probabilistic Transform for Voice Conversion” IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, Vol. 6, No. 2, March, 1998). In this case, a GMM is obtained from the speech spectrum parameters of a conversion-source speaker, and a regression matrix of each mixture of the GMM is obtained by a regression analysis using a pair of the spectrum parameters of the conversion-source speaker and the spectrum parameters of the conversion-target speaker to thereby make voice conversion rules. For voice conversion, the regression matrix is weighted by the probability that the spectrum parameters of the input speech are output in each mixture of the GMM. This makes the conversion rules continuous, allowing natural voice conversion. In this way, conversion rules are learned from a pair of the speech of the conversion-source speaker and the speech of the conversion-target speaker. In Nonpatent Document 1, speech data of two speakers in the unit of short phonetic unit are associated with each other by dynamic time warping (DTW) to form conversion-rule learning data. With the known voice-conversion-rule making apparatus, as disclosed in Nonpatent Document 1, speech data of the same content of a conversion-source speaker and a conversion-target speaker are associated with each other, from which conversion rules are learned.
Inputting any sentence to generate a speech waveform is referred to as text-to-speech synthesis. The text-to-speech synthesis is generally performed by three steps by a language processing means, a prosody processing means, and a speech synthesizing means. Input text is first subjected to a morphological analysis and a syntax analysis by the language processing means, and is then processed for accent and intonation by the prosody processing means, whereby phoneme sequence and prosodic information (fundamental frequency, phoneme duration, etc.) are output. Finally, the speech-waveform generating means generates a speech waveform according to the phoneme sequence and prosodic information. One of speech synthesis methods is of a speech-unit selection type which selects a speech unit from a speech unit database containing a lot of speech units, and synthesizes them toward the goal of the input phoneme sequence and prosodic information. The speech synthesis of the speech-unit selection type is such that speech units are selected from the stored mass speech units according to the input phoneme sequence and prosodic information, and the selected speech units are concatenated to synthesize speech. Another speech synthesis method of a plural-unit selection type is such that a plurality of speech units are selected for each synthesis units in an input phoneme sequence according to the degree of the distortion of synthetic speech toward the target of the input phoneme sequence and prosodic information, and the selected speech units are fused to generate new speech units, and the speech units are concatenated to synthesize speech (e.g., refer to Japanese Application KOKAI 2005-164749). An example of the method of fusing speech units is a method of averaging pitch-cycle waveforms.
Suppose voice conversion of a speech-unit database of text-to-speech synthesis using a low volume of speech data of a conversion-target speaker. This enables speech synthesis of any sentence using the voice quality of a conversion-target speaker having limited speech data. In order to apply the method disclosed in the above-mentioned Nonpatent Document 1 to this voice conversion, speech data of the same contents of the conversion-source speaker and the conversion-target speaker must be prepared, with which voice conversion rules are made. Accordingly, by the method disclosed in Nonpatent Document 1, when voice conversion rules are learned using mass speech data of a conversion-source speaker and low-volume speech data of conversion-target speaker, the speech contents in the speech data for use in learning voice conversion rules is limited, so that only the limited speech contents are used to learn voice conversion rules although there is a mass speech unit database of the conversion-source speaker. This disables learning of voice conversion rules reflecting the information contained in the mass speech segment database of the conversion-source speaker.
As has been described, the related art has the problem that when voice conversion rules are learned using mass speech data of a conversion-source speaker and low-volume speech data of a conversion-target speaker, the speech contents of the speech data for use as learning data is limited, thus preventing learning of voice conversion rules reflecting the information contained in the mass speech unit database of the conversion-source speaker.