The present invention relates to a singing synthesis technique for synthesizing singing voices (human voices) in accordance with score data representative of a musical score of a singing music piece.
Voice synthesis techniques, such as techniques for synthesizing singing voices and text-reading voices, are getting more and more prevalent these days, and the voice synthesis techniques are broadly classified into one based on a voice segment connection scheme and one using voice models based on a statistical scheme. In the voice synthesis technique based on the voice segment connection scheme, segment data indicative of respective waveforms of a multiplicity of phonemes are prestored in a database, and voice synthesis is performed in the following manner. Namely, segment data corresponding to phonemes, constituting voices to be synthesized, are read out from the database in order in which the phonemes are arranged, and the read-out segment data are interconnected after pitch conversion etc. are performed on the segment data. Many of the voice synthesis techniques in ordinary practical use today are based on the voice segment connection scheme. Among examples of the voice synthesis technique using voice models is one using a Hidden Markov Model (hereinafter referred to as “HMM”). The Hidden Markov Model (HMM) is indented to model a voice on the basis of probabilistic transition between a plurality of states (sound sources). More specifically, each of the states, constituting the HMM, outputs a character amount indicative of its specific acoustic characteristics (e.g., fundamental frequency, spectrum, or characteristic vector comprising these elements), and voice modeling is implemented by determining, by use of the Baum-Welch algorithm or the like, an output probability distribution of character amounts in the individual states and state transition probability in such a manner that variation over time in acoustic character of the voice to be modeled can be reproduced with the highest probability. The voice synthesis using the HMM can be outlined as follows.
The voice synthesis technique using the HMM is based on the premise that variation over time in acoustic character is modeled for each of a plurality of kinds of phonemes through machine learning and then stored into a database. The following describe the above-mentioned modeling using the HMM and subsequent databasing, in relation to a case where a fundamental frequency is used as the character amount indicative of the acoustic character. First, each of a plurality kinds of voices to be learned is segmented on a phoneme-by-phoneme basis, and a pitch curve indicative of variation over time in fundamental frequency of the individual phonemes is generated. Then, for each of the phonemes, an HMM representing the pitch curve with the highest probability is identified through machine learning using the Baum-Welch algorithm or the like. Then, model parameters defining the HMM (HMM parameters) are stored into a database in association with an identifier indicative of one or more phonemes whose variation over time in fundamental frequency is represented by the HMM. This is because, even for different phonemes, characteristics of variation over time fundamental frequency may sometimes be represented by a same HMM. Doing so can achieve a reduced size of the database. Note that the HMM parameters include data indicative of characteristics of a probability distribution defining appearance probabilities of output frequencies of states constituting the HMM (e.g., average value and distribution of the output frequencies, and average value and distribution of change rates (first- or second-order differentiation)) and data indicative of state transition probabilities.
In a voice synthesis process, on the other hand, HMM parameters corresponding to individual phonemes constituting human voices to be synthesized are read out from the database, and a state transition that may appear with the highest probability in accordance with an HMM represented by the read-out HMM parameters and output frequencies of the individual states are identified in accordance with a maximum likelihood estimation algorithm (such as the Viterbi algorithm). A time series of fundamental frequencies (i.e., pitch curve) of the to-be-synthesized voices is represented by a time series of the frequencies identified in the aforementioned manner. Then, control is performed on a sound source (e.g., sine wave generator) so that the sound source outputs a sound signal whose fundamental frequency varies in accordance with the pitch curve, after which a filter process dependent on the phonemes (e.g., a filter process for reproducing spectra or cepstrum of the phonemes) is performed on the sound signal. In this way, the voice synthesis is completed. In many cases, such a voice synthesis technique using HMMs have been used for synthesis of read voices (as disclosed for example in Japanese Patent Application Laid-open Publication No. 2002-268,660). However, in recent years, it has been proposed that the voice synthesis technique for singing synthesis (see, for example, “Trainable Singing Voice Synthesis System Capable of Representing Personal Characteristics and Singing Style”, by Sako Shinji, Saino keijiro, Nankaku Yoshihiko and Tokuda Keiichi, in a study report “Musical Information Science” of Information Processing Society of Japan, 2008(12), pp. 39-44 20080208, which will hereinafter be referred to as “Non-patent Literature 1”). In order to synthesize natural singing voices through singing synthesis based on the segment connection scheme, there is a need to database a multiplicity of segment data for each of voice characters (e.g., high clean voice, husky voice, etc.) of singing persons. However, with the voice synthesis technique using HMMs, data indicative of a probability density distribution for generating data of character amounts are retained or stored instead of all of character amounts being stored as data, and thus, such a synthesis technique is suited to be incorporated into small-size electronic equipment, such as portable game machines and portable phones.
In the case where text-reading voices are to be synthesized using HMMs, it is conventional to model a voice using a phoneme as a minimum component unit of a model and taking into account a context, such as an accent type, part of speech and arrangement of preceding and succeeding phonemes; such modeling will hereinafter referred to as “context-dependent modeling”. This is because, even for a same phoneme, a manner of variation over time in acoustic character of the phoneme can differ if the context differs. Thus, in performing singing synthesis by use of HMMs too, it is considered preferable to perform context-dependent modeling. However, in singing voices, variation over time in fundamental frequency representative of a melody of a music piece is considered to occur independently of a context of phonemes constituting lyrics, and it is considered that a singing expression unique to a singing person appears in such variation over time in fundamental frequency (namely, melody singing style). In order to synthesize singing voices that accurately reflect therein a singing expression unique to a singing person in question and that sound more natural, it is considered necessary to accurately model the variation over time in fundamental frequency that is independent of the context of phonemes constituting lyrics. However, it is hard to say that the framework of the conventionally-known technique, where the modeling is performed using phonemes as minimum component units of a model, can appropriately model variation over time in fundamental frequency based on a singing expression that straddles across a plurality of phonemes.