This disclosure relates to a voice synthesis technology, and more particularly, relates to a real-time voice synthesis technology.
A voice synthesis technology is widespread in which a voice signal representative of a guidance voice in a voice guidance, a literary work reading voice, a song singing voice or the like is synthesized by electric signal processing by use of a plurality of kinds of synthesis information. For example, in the case of the singing voice synthesis, as the synthesis information, musical expression information is used such as information representative of the pitches and durations of the musical notes constituting a melody of a song which is the object of singing voice synthesis and information representative of phoneme sequences of the lyrics uttered in time with the musical notes. In the case of synthesis of a voice signal of a guidance voice in a voice guidance or a literary work reading voice, information representative of the phonemes of the guidance sentence or the sentence of the literary work and information representative of change of prosody such as intonation and accent are used as the synthesis information. Conventionally, for the voice synthesis of this kind, a so-called batch processing method has been common in which various kinds of synthesis information related to the entire voice of the object of synthesis are all inputted to a voice synthesizing apparatus in advance and a voice signal representative of the voice waveform of the entire voice of the synthesis object is generated in one batch based on those pieces of synthesis information. However, in recent years, a real-time voice synthesis technology has also been proposed (see, for example, JP-B-3879402).
An example of the real-time voice synthesis is a technology of synthesizing a singing voice by previously inputting information representative of the phoneme sequence of the lyrics of the entire song to a singing voice synthesizing apparatus and sequentially specifying the pitch and the like in uttering the lyrics by operating a keyboard resembling a piano keyboard. In recent years, it has also been proposed to perform singing voice synthesis in units of musical notes by letting the user sequentially input, for each musical note, musical note information representative of the pitch and phoneme sequence information representative of the phoneme sequence of the portion of the lyrics uttered in time with the musical note by use of a singing voice synthesis keyboard where a phoneme information input portion in which manipulating members for inputting the phonemes (consonants and vowels) constituting the phoneme sequence of the lyrics are arranged and a musical note information input portion resembling a piano keyboard are arranged side by side.
When information representative of the phoneme sequence of the lyrics of the entire song is previously stored in a singing voice synthesizing apparatus to perform real-time singing voice synthesis, a faltering unnatural singing voice as if the lyrics were uttered with a delay from the musical score is sometimes synthesized. The reason that such a falter occurs is as follows:
FIG. 5A is a view showing an example of the utterance timing of each phoneme when a person sings a portion of lyrics constituted by a consonant and a vowel in time with a musical note. In FIG. 5A, the musical note is represented by a rectangle N shown on the staff, and the portion of the lyrics sung in time with the musical note is shown in the rectangle. As shown in FIG. 5A, when a person sings a portion of lyrics constituted by a consonant and a vowel in time with a musical note, it is typical that the person starts the utterance of the portion at time T0 preceding time T1 corresponding to the utterance timing on the musical score (symbol # in FIGS. 5A and 5B represents a silence; the same applies in FIG. 3.) and utters the boundary part between the consonant and the vowel at time T1.
Likewise, in the real-time singing voice synthesis using a keyboard resembling a piano keyboard, as shown in FIG. 5B, it is common that the user starts to depress a key K for specifying the pitch with a finger F at time T0 preceding the position of the musical note on the musical score and fully depresses the key K at time T1. However, since this kind of keyboard is generally structured so as to output information representative of the pitch (or to output information representative of the pitch and information representative of the velocity corresponding to the key depression speed) at the point of time when the key is fully depressed, it is at the time when the key is fully depressed (time T1) that the information representative of the pitch is actually outputted. On the other hand, in the singing voice synthesizing apparatus, singing voice synthesis is not started until both the phoneme sequence information and the information representative of the pitch are acquired. Even if the time required for the synthesis processing is short enough to be ignored, it is at time T1 that the output of the singing voice is started, and the time lag (T1-T0) between when the key K is started to be depressed and when it is fully depressed appears as the above-mentioned falter. The same occurs when singing voice synthesis is performed by letting the user sequentially input a portion of the lyrics and the pitch for each musical note and when synthesis of a guidance voice or a reading voice is performed.
The present disclosure is made in view of the above-mentioned problem, and an object thereof is to provide a technology of enabling real-time synthesis of an unfaltering natural voice.