(1) Field of the Invention
The present invention relates to a voice quality conversion apparatus that converts voice quality of an input speech into another voice quality, and a pitch conversion apparatus that converts a pitch of the input speech into another pitch.
(2) Description of the Related Art
In recent years, the development of speech synthesis technologies has allowed synthesized speeches to have significantly high sound quality.
However, the conventional use of such synthesized speeches is still centered on uniform purposes, such as reading of news texts as news announcers.
Meanwhile, in services of mobile telephones and others, a speech having a distinctive feature has started to be distributed as a content, such as a synthesized speech highly representing a personal speech and a synthesized speech having a distinct prosody and voice quality as the speech style of a high-school girl or a speech with a distinct intonation of the Kansai region in Japan. Thus, in pursuit of further amusement in interpersonal communication, a demand for creating a distinct speech to be heard by the other party is expected to grow.
As one of the conventional speech synthesis methods, what is known is an analysis-synthesis system of synthesizing speech using a parameter by analyzing the speech. In the analysis-synthesis system, a speech signal is separated into a parameter indicating vocal tract information (hereinafter referred to as vocal tract information) and a parameter indicating sound source information (hereinafter referred to as sound source information), by analyzing a speech based on the speech production process. Furthermore, the voice quality of a synthesized speech can be converted into another voice quality by modifying each of the separated parameters in the analysis-synthesis system. Here, a model known as a sound source/vocal tract model is used for the analysis.
In such an analysis-synthesis system, only a speaker feature of an input speech can be converted by synthesizing input text using a small amount of a speech (for example, vowel voices) having target voice quality. Although the input speech generally has natural temporal movement (dynamic feature), the small amount of speech (such as utterance of isolated vowels) having target voice quality does not have much temporal movement. When voice quality is converted using the two kinds of input speeches, it is necessary to convert the voice quality into the speaker feature (static feature) included in the target voice quality while maintaining the temporal movement included in the input speech. In order to support the necessity, Japanese Patent No. 4246792 discloses morphing vocal tract information between an input speech and a speech with target voice quality so that the static feature of the target voice quality is represented while maintaining the dynamic feature of the input speech. When such a conversion is used for converting sound source information, a speech closer to the target voice quality can be generated.
Furthermore, the speech synthesis technologies include a method of generating a sound source waveform representing sound source information, using a sound source model. For example, Rosenberg Klatt model (RK model) is known as the sound source model (see “Analysis, synthesis, and perception of voice quality variations among female and male talkers”, Journal of the Acoustics Society of America, 87(2), February 1990, pp. 820-857).
The method is for modeling a sound source waveform in a time domain, and generating a sound source waveform using a parameter representing the modeled waveform. Using the RK model, a sound source feature can be flexibly changed by modifying the parameter.
Equation 1 indicates a sound source waveform (r) modeled in the time domain using the RK model.
                                              ⁢                                            r              ⁡                              (                                  n                  ,                  η                                )                                      =                                          r                c                            ⁡                              (                                                      nT                    s                                    ,                  η                                )                                              ⁢                                          ⁢                                                    r                c                            ⁡                              (                                                      nT                    s                                    ,                  η                                )                                      =                          {                                                                                                                                                                                                                                                                        27                                  ⁢                                  AV                                                                                                  2                                  ⁢                                                                      OQ                                    2                                                                    ⁢                                                                      t                                    0                                                                                                                              ⁢                                                              (                                                                  t                                  +                                                                                                            q                                      0                                                                        ⁢                                                                          t                                      0                                                                                                                                      )                                                                                      -                                                                                                                            81                                  ⁢                                  AV                                                                                                  4                                  ⁢                                                                      OQ                                    3                                                                    ⁢                                                                      t                                    0                                    2                                                                                                                              ⁢                                                                                                (                                                                      t                                    +                                                                          OQt                                      0                                                                                                        )                                                                2                                                                                                              ,                                                                                    -                                                              OQt                                0                                                                                      <                            t                            ≤                            0                                                                                                                                                                                        0                          ,                          elsewhere                                                                                                      ⁢                                                                          ⁢                                                                          ⁢                  η                                =                                  (                                      AV                    ,                                          t                      0                                        ,                    OQ                                    )                                                                                        [                  Equation          ⁢                                          ⁢          1                ]            
Here, t denotes a continuous time, Ts denotes a sampling period, and n denotes a discrete time for each Ts. Furthermore, AV (abbreviation of Amplitude of Voice) denotes a voiced sound source to amplitude, t0 denotes a fundamental period, and OQ (abbreviation of open quotient) denotes a percentage of time during which a glottis is open with respect to the fundamental period. η denotes a set of AV, t0, and OQ.
Since the sound source waveform with a fine structure is represented by a relatively simple model in the RK model, there is an advantage that voice quality can be flexibly changed by modifying a model parameter. However, the fine structure of a sound source spectrum that is a spectrum of an actual sound waveform cannot be sufficiently represented due to the lack of representation capabilities of models. As a result, there is a problem that the sound quality of a synthesized speech lacks natural voice, which will become a very synthetic one.
The present invention is to solve the problems, and has an object of providing a voice quality conversion apparatus and a pitch conversion apparatus each of which can obtain natural voice quality even when a shape of a sound source waveform is changed or the fundamental frequency of a sound source waveform is converted.