The present invention relates to a singing synthesis parameter data estimation system, a singing synthesis parameter data estimation method, and a singing synthesis parameter data estimation program that automatically estimate singing synthesis parameter data from an audio signal of a user's input singing voice, for example, in order to support music production which uses singing synthesis.
Various researches have been so far made on generation of a human-like singing voice by a singing synthesis technology that uses a computer. Nonpatent Documents 1 through 3 listed below disclose methods of coupling elements (waveforms) of an audio signal of input singing voice that have been sampled. Nonpatent Document 4 listed below discloses a method of modeling an audio signal of singing voice to perform synthesis (HMM synthesis). Nonpatent documents 5 through 7 listed below disclose researches on analysis and synthesis of an audio signal of input singing voice from an audio signal of reading speech. In the researches described in Nonpatent Documents 5 through 7, high-quality singing synthesis with user's voice timbre preserved therein has been studied. By these researches, synthesis of the human-like singing voice is now getting possible, and some of the researches, which are a singing synthesis system “Vocaloid” (trademark) in Patent Document 3 and singing synthesis software in Patent Document 8 listed below, are commercialized.
When the user utilizes these related arts, there needs to be an interface that receives lyric data, musical score information that specifies a song, and a singing expression about “how the song is sung.” In the arts of Nonpatent Documents 2 through 4, lyric data and musical score information (on a pitch, a pronunciation onset time, and a sound duration) are needed. In the art of Nonpatent Document 9 listed below, only lyric data is supplied to a singing synthesis system. In the arts of Patent Documents 5 through 7, an audio signal of read speech, lyric data, and musical score information are supplied to a singing synthesis system. In the art of Nonpatent Document 10 listed below, an audio signal of input singing voice and lyric data are supplied to a singing synthesis system. In contrast to these related arts, in the arts of Nonpatent Documents 2 and 3, the user adjusts a parameter on the singing expression among parameters supplied to a singing synthesis system. In the arts of Nonpatent Documents 4 and 6, the way of singing or singing style is modeled in advance. In the method described in Nonpatent Document 7, a musical symbol (for crescendo or the like) is supplied to the singing synthesis system. In the method of Nonpatent Document 10, a parameter on the singing expression is extracted from an audio signal of input singing voice.
However, none of the related arts can iteratively estimate the parameters or can modify the pitch or the dynamics of an audio signal of input singing voice, even if the audio signal of input singing voice can be supplied as an input. In the singing synthesis system “Vocaloid” (trademark) manufactured and sold by Yamaha Corporation, the user supplies lyric information and musical score information to the “Vocaloid”, using a piano roll score editor, and manipulates parameters for adding expressive effects, thereby synthesizing a singing voice.
Fine adjustment of the parameters for adding expressive effects is needed in order to obtain a more natural or a more individualistic singing voice. However, depending on capability of the user, it is difficult to create a singing voice desired by the user. Further, when a condition for singing synthesis (such as a singing synthesis system or sound source data of the singing synthesis system) differs, parameter data for constituting the singing voice needs to be adjusted again.
Nonpatent Document 10 proposes the method of extracting features such as a pitch, dynamics, and vibrato information (on a vibrato extent and a vibrato frequency) upon reception of the audio signal of input singing voice and the lyric data, and supplying the extracted features as a singing synthesis parameter. In the art described in Nonpatent Document 10, it is assumed that the singing synthesis parameter data thus obtained is edited by the user on the score editor of the singing synthesis system. However, even if the features of the pitch and the like extracted from the audio signal of input singing voice are used as the singing synthesis parameter without alteration or even if an editing operation that uses the existing editor of the singing synthesis system is performed, a change in singing synthesis conditions cannot be accommodated.
In the art described in Nonpatent Document 10, determination of a pronunciation onset time and a sound duration for each syllable of lyrics (hereinafter referred to as lyric alignment) is automatically made by Viterbi alignment used in speech recognition technology. Then, in order to obtain high-quality synthesized sounds, it is necessary to obtain the lyric alignment having almost 100 percent accuracy. However, only with the Viterbi alignment, it is difficult to obtain such a high accuracy. Further, results of the lyric alignment do not completely match synthesized sounds that have been output. However, any conventional arts have not improved this mismatch.
Incidentally, the documents of the related arts are as follows:
[Nonpatent Document 1]    J. Bonada et al.: “Synthesis of the Singing Voice by Performance Sampling and Spectral Models,” In IEEE Signal Processing Magazine, Vol. 24, Iss. 2, pp. 67-79, 2007.
[Nonpatent Document 2]    Yuki Yoshida et al.: “Singing Synthesis System: CyberSingers,” IPSJ SIG Technical Report 99-SLP-25-8, pp. 35-40, 1998.
[Nonpatent Document 3]    Hideki Kenmochi et al.: “Singing Synthesis System “VOCALOID” Current Situation and Todo lists,” IPSJ SIG Technical Report 2008-MUS-74-9, pp. 51-58, 2008.
[Nonpatent Document 4]    Shinji Sako et al.: “A Trainable Singing Voice Synthesis System Capable of Representing Personal Characteristics and Singing Styles,” IPSJ SIG Technical Report 2008-MUS-74-7, pp. 39-44, 2008.
[Nonpatent Document 5]    Hideki Kawahara et al.: “Scat Generation Research Program Based on STRAIGHT, a High-quality Speech Analysis, Modification and Synthesis System,” Transactions of Information Processing Society of Japan, Vol. 43, No. 2, pp. 208-218, 2002.
[Nonpatent Document 6]    Takeshi Saitou et al.: “SingBySpeaking: Singing Voice Conversion System from Speaking Voice By Controlling Acoustic Features Affecting Singing Voice Perception,” IPSJ SIG Technical Report 2008-MUS-74-5, pp. 25-32, 2008.
[Nonpatent Document 7]    Tsuyoshi Moriyama et al.: “Transformation of Reading to Singing with Favorite Style,” IPSJ SIG Technical Report 2008-MUS-74-6, pp. 33-38, 2008.
[Nonpatent Document 8]    NTT-AT Wonderhorn (http://www.ntt-at.co.jp/product/wonderhorn/)
[Nonpatent Document 9]    Yuichiro Yonebayashi et al: “A Web-based System for Automatic Song Composition Using the Lyric Prosody,” Interaction 2008, pp. 27-28, 2008.
[Nonpatent Document 10]    J. Janer et al.: “Performance-Driven Control for Sample-Based Singing Voice Synthesis,” In DAFx-06, pp. 42-44, 2006.