A) Field of the Invention
This invention relates to a singing voice synthesizing method, a singing voice synthesizing apparatus and a storage medium by using a phase vocoder technique.
B) Description of the Related Art
Conventionally, as a singing voice synthesizing technique, a singing voice synthesizing using a well-known Spectral Modeling Synthesis (SMS) technique according to U.S. Pat. No. 5,029,509 is well known. (For example, refer to Japanese Patent No. 2906970.)
FIG. 21 shows a singing voice synthesizing apparatus adopting the technique explained in Japanese Patent No. 2906970. At Step S1, a singing voice signal is input, and at Step S2, a SMS analyzing process and a section logging process is executed to the input singing voice signal.
In the SMS analyzing process, the input voice signal is divided into a series of time frames, and one set of a magnitude spectrum data is generated in each frame by Fast Fourier Transform (FFT) and the like, and a linear spectrum corresponding to plurality of peaks from one set of magnitude spectrum data by each frame. A data representing an amplitude value and frequency of these linear spectrums are called a Deterministic Component. Next, a spectrum of the deterministic component is subtracted from a spectrum of an input voice waveform to obtain a remaining difference spectrum. This remaining difference spectrum is called Stochastic Component.
In the section logging process, the deterministic component data and the stochastic data obtained in the SMS analyzing process are divided corresponding to a voice synthesis unit. The voice synthesis unit is a structural element of lyrics. For example, a voice synthesis unit is consisted of a single phoneme such as [a] or [i] or, a phonemic chain (a chain of a plurality of phonemes) such as [a—i] or [a—p].
In a voice synthesis unit database DB, a deterministic component data and stochastic component data are stored for every voice synthesis unit.
In the singing voice synthesizing, at Step S3, lyrics data and melody data are input. Then, at Step S4, a phonemic series/voice synthesis unit conversion process is executed on the phonemic series that the lyrics data represents to divide the phonemic series into a voice synthesis unit. Then, the deterministic component data and the stochastic component data are read from the database DB as a voice synthesis unit data for every voice synthesis unit.
At Step S5, a voice synthesis unit connecting process is executed on the voice synthesis unit data (the deterministic component data and the stochastic component data) read from the database DB to connect voice synthesis unit data in an order of pronunciations. At Step S6, new deterministic component data adapting to the musical note pitch is generated based on the musical note pitch that the deterministic component data and the melody data indicate for every voice synthesis unit. At this time, if a spectrum intensity is adjusted to be a form of a spectrum envelope that the deterministic component data processed at Step S5 is taken over, a musical tone of the voice signal input at Step S1 can be reproduced with the new deterministic component data.
At Step S7, the deterministic component data generated at Step S6 is added to the stochastic component data executed the process at Step S5 in every voice synthesis unit. Then, at Step S8, the data to which the adding process is executed at Step S7 is converted to a synthesized voice signal of time region by a reverse FFT and the like in each voice synthesis unit.
For example, to synthesizing a singing voice [saita], voice syntheses units corresponding to voice synthesis units [#s], [s—a], [a], [a—i], [l], [i—t], [a], and [a#] (# represents a silence) are read from the database DB, and they are connected each other at Step S5. Then, at Step S6, a deterministic component data that has a pitch corresponding to the input musical note pitch is generated in each voice synthesis unit. After the adding process at Step S7 and the converting process at Step S8, a singing voice signal of [saita] can be obtained.
According to the above-described prior art, there is a tendency that a sense of unity between the deterministic component and the stochastic component is not satisfactory. That is, there is a tendency that the singing voice is caught as an artificial voice because the voice signal pitch input at Step S1 is converted corresponding to the input musical note pitch at Step S6 and the stochastic component data is added to the deterministic component data with the converted pitch at Step S7. For example, the stochastic component data is sounded being split in a section of a long sound such as [i] in singing [saita].
In order to deal with this kind of tendencies, the inventors of the present invention suggested that an amplitude spectrum distribution in a lower region that the stochastic component data represents is adjusted corresponding to the input musical note pitch before (refer to Japanese Patent Application No. 2000-401041). However, if the stochastic component data is adjusted as above, it is not easy to control splitting and resounding of the stochastic component completely.
Also, in the SMS technique, analysis of a voiced fricative or plosive sound is difficult, and it is a problem that the synthesizing voice will be very artificial sound. The SMS technique is on the assumption that a voice signal is consisted of a deterministic component and a stochastic component, and it is a fundamental problem that the voice signal cannot be split into the deterministic component and the stochastic component as the SMS technique.
On the other hand, the phase vocoder technique is explained in a specification of the U.S. Pat. No. 3,360,610. In the phase vocoder technique, a signal was represented by a filter bank before and recently has been represented by a frequency region as a result of the FFT of input signal. Recently, the phase vocoder technique is widely used for a time-stretch (stretching or shortening of a time axis without changing the original pitch), a pitch-shift (changing a pitch without changing the time length) and the like. As this kind of pitch changing technique, the result of FFT of the input signal is not used as it is. It is well known that the pitch shift is executed by moving the spectrum distribution on a frequency axis in each spectrum distribution region after dividing the FFT spectrum into a plurality of spectrum distribution regions centered at a local peak. (For example, refer to J. Laroche and M. Dolson, “New Phase-Vocoder Techniques for Real-Time Pitch Shifting, Chourusing, Harmonizing, and Other Exotic Audio Modifications” J. Audio Eng. Soc., Vol. 47, No. 11, 1999). However, relevancy between the pitch shifting technique and the singing voice synthesizing technique is not clear.