1. Field of the Invention
The present invention relates to a singing voice synthesizing apparatus that synthesizes a singing voice, a method of synthesizing a singing voice, and a program for realizing the method thereof.
2. Description of the Related Art
In the past, there has been a wide range of attempts to synthesize singing voice.
One of these attempts, an application of speech synthesis by rule, receives inputs of pitch data, which corresponds to the pitch of a note, and of lyric data, and synthesizes speech using a synthesis-by-rule device for text-to-speech synthesis. In most cases, raw waveform data or analyzed and parameterized data are stored in a database in units of phonemes or phoneme chains comprised of two or more phonemes. At the time of synthesis, required voice fragments (phonemes or phoneme chains) are selected, concatenated, and synthesized. Examples are disclosed in Japanese Laid-Open Patent Publications (Kokai) Nos. S62-6299, H10-124082, and H11-1184490, among others.
However, since the object of these technologies is to synthesize a speaking voice, they are not always capable of synthesizing a singing voice with satisfactory quality.
For example, a singing voice synthesized by a method of overlapping and adding waveforms as typified by PSOLA (Pitch-Synchronous OverLap and Add) has a good degree of comprehensibility, but often has the problems of unnatural sounding of elongated tones, for which the quality of a singing voice varies the greatest, and an unnatural sounding synthesized voice when there are slight fluctuations of pitch and vibrato, which are essential for a singing voice.
Moreover, attempting to synthesize a singing voice using a waveform concatenating type speech synthesizing device with a large-scale corpus base would require an astronomically large number of fragment data if the original data are to be concatenated and output without any processing.
On the other hand, synthesizers whose original purpose is for synthesizing a singing voice have also been proposed. A well-known example is the synthesis method of formant synthesis (Japanese Laid-Open Patent Publication (Kokai) No. 3-200300). However, although this method offers a large degree of freedom with respect to the quality and fluctuations of vibrato and pitch of elongated sounds, the clarity of synthesized sounds (especially consonants) is poor, and therefore quality is not always satisfactory.
U.S. Pat. No. 5,029,509 discloses a technique known as Spectral Modeling Synthesis (SMS) for analyzing and synthesizing a musical sound using a model that expresses an original sound as comprised of two components, namely a deterministic component and a stochastic component.
With SMS analysis and synthesis, good control of the musical characteristics of a musical sound is possible, and at the same time, in the case of a singing voice, through use of the stochastic component, a high degree of clarity can be expected from even the consonants. Therefore, applying this technique to the synthesis of a singing voice is expected to achieve a synthesized sound having a high degree of clarity and musicality. In fact, Japanese Patent No. 2906970 proposes specific applications for sound synthesis based on SMS analysis and synthesis techniques, and at the same time, also describes a methodology for utilizing SMS techniques in singing voice synthesis (singing synthesizer).
An application of the techniques proposed in the aforementioned Japanese Patent No. 2906970 to a singing voice synthesizing apparatus will be described with reference to FIG. 17.
In FIG. 17, input voices are SMS-analyzed and segmented into individual voice fragments (phonemes or phoneme chains) by an SMS-analyzer/segmentor 103, which are stored to generate a phoneme database 100. The database 100, comprising voice fragment data (phoneme data 101 and phoneme chain data 102) for a single frame or plurality of frame strings arranged in a time series, stores SMS data for each frame, namely changes over time of the spectral envelope of the deterministic component, the spectral envelope and phase spectrum of the stochastic component, etc.
When synthesizing a singing voice sound, a phoneme string comprising the desired lyrics is obtained, a phoneme-to-fragment converter 104 determines the required voice fragments (phonemes or phoneme chains) that comprise the phoneme string, and then SMS data (deterministic component and stochastic component) of the required voice fragments is read from the aforementioned database 100. Next, a fragment concatenator 105 concatenates the read-out SMS data of the voice fragments into a time series. For the deterministic component, based on pitch information corresponding to a melody of the song, a deterministic component generator 106 generates harmonic components having the desired pitch while preserving the shape of the spectral envelope of the deterministic component. For example, to synthesize the Japanese word “saita”, the fragments of “#s”, “s”, “s-a”, “a”, “a-i”, “i”, “i-t”, “t”, “t-a”, “a”, and “a#” are concatenated, and the deterministic component of the desired pitch is generated while preserving the shape of the spectral envelope included in the SMS data obtained from the fragment concatenation. Next, the generated deterministic component and the stochastic component are added together by a synthesizing means 107, and the result thereof is transformed into time domain data to obtain synthesized voice.
By thus utilizing these SMS techniques, natural sounding synthesized singing with good comprehensibility can be obtained even for elongated sounds.
However, the method described in the aforementioned Japanese Patent No. 2906970 is overly rudimentary and simplistic, and the following types of problems will occur if a singing voice is synthesized according to that method.                Because the spectral envelope shape of the deterministic component of a voiced sound changes somewhat depending on pitch, synthesis at a pitch different from the pitch used at the time of analysis cannot, by itself, achieve good tone color.        When performing SMS analysis in the case of a voiced sound, even if the deterministic component is removed, a small fraction of the deterministic component remains in the residual component. Therefore, using the same residual component (stochastic component) directly to synthesize a singing sound at a pitch different from the original sound as noted above causes the residual component to become audible noticeably or like noise.        Because the SMS analysis results of phoneme data and phoneme chain data are superposed temporally as they are, the duration of an elongated sound and transitional time between phonemes cannot be adjusted. In other words, it is not possible to sing at a desired tempo.        Noise is apt to be generated when concatenating the phonemes or phoneme chains.        