1. Field of the Invention
The present invention relates to a singing voice synthesizing device for synthesizing a singing voice according to music information and word information.
2. Description of the Related Art
Chorus synthesizing devices have already been developed to synthesize a singing voice and then generate a chorus from synthesized singing voices by inputting words of a song and note information put down onto a musical score corresponding to the words. Described below are the related conventional technologies.
FIG. 1 shows an example of a musical score for a mixed 4-voice-part chorus. FIG. 2 shows the music information and the word information generated from the musical score shown in FIG. 1. The music information and the word information contain the information for four voice parts, that is, soprano, alto, tenor, and bass. The music information is entered in the description language called "MML (Music Macro Language)" for use in a music performance through a personal computer. For example, the pitch of C is represented by C, D by D, E by E, F by F, G by G, A by A, and B by B. The middle octave is specified by O, and higher and lower octaves are represented by &gt; and &lt; respectively. The timing is indicated by "8" for an eighth note, "2" for a half note, and "4" for a quarter note. Furthermore, it is indicated by "8." for a dotted eighth note, "4." for a dotted quarter note, and "2." for a dotted half note. The basic note is specified by "L" and the description of the timing can be omitted unless otherwise specified. For example, line 2 in FIG. 2 indicates "L8" to specify an eighth note as a basic note, and the description of "8" for the eighth note can be omitted afterwards. A sharp symbol is represented by "#" or "+", a flat symbol by "-", and a tie symbol by "&",
Thus, note data are generated according to a musical score by appropriately combining the above listed rules. For example, an eighth note at Do is represented by "C", a quarter note at flatted Re is represented by "D-4", and a dotted half note at sharped Mi is represented by "E#2." As for word information, words of a song are provided for corresponding notes.
FIG. 3 shows the voice part for soprano extracted from the music and word information shown in FIG. 2. Likewise, the other voice parts alto, tenor, and bass can be extracted from the entire music and word information.
FIG. 4 shows phonetic symbols generated from the word information for soprano shown in FIG. 3. A phonetic symbol represents vowels or consonants of a voice sound separately.
FIG. 5 shows timing information generated from the music information of each voice part as shown in FIG. 3 and the phonetic symbols shown in FIG. 4. In the case of the song shown in FIG. 1, the tempo 110 indicates 60/110 second for a quarter note equal to approximately 545 ms based on which the timing is determined for the song. According to the timing information shown in FIG. 5, the first data "Q 272" indicates 272 ms for an eighth note equal to a half of 545 ms for a quarter note. The next "1 16" indicates 16 ms for the consonant "1" of the word "Let's" Then, "e 156" indicates 156 ms for the vowel "e" of the word "Let's", and the next "ts 100" indicates 100 ms for the consonant "ts" of that word. The word "let's" is assigned an eighth note according to the music information, and can be assigned 272 ms as a total of the vowel and the consonants. Thus, the timing information is obtained from the music information and provided for each phonetic symbol.
FIG. 6 shows the general configuration of the conventional singing voice signal generating device.
In FIG. 6, the music and word information as shown in FIG. 2 is input to a music/word input unit 1. A voice part extracting unit 2 extracts each voice part from the music and word information (FIG. 3 shows the information for soprano, and information can be extracted also for alto, tenor, and bass). The music and word information for each voice part is input to a corresponding singing voice signal synthesizing unit 3a, 3b, or 3c (although three singing voice signal synthesizing units are shown in FIG. 6, any number of required voice parts is actually accepted). A singing voice signal of each voice part is generated by singing voice signal synthesizing units 3a, 3b, and 3c. Each of the generated singing voice signals is applied to a chorus signal generating unit 4 for generating a chorus signal. The chorus signal generated by the chorus signal generating unit 4 is converted to an analog signal by a D/A converter not shown in FIG. 6, and then output as a chorus from a singing voice output unit 5 (for example, a speaker through an amplifier).
FIG. 7 shows in detail the configuration of the singing voice signal synthesizing unit 3. The singing voice signal synthesizing unit 3 comprises a rhythm information generating unit 31 and a singing voice signal generating unit 32.
FIG. 8 shows in detail the configuration of the rhythm information generating unit 31. The rhythm information generating unit 31 comprises a phonetic symbol generating unit 311, a note timing generating unit 312, a pitch information generating unit 313, and a loudness information generating unit 314. The phonetic symbol generating unit 311 divides a voice sound into vowels and consonants after representing a word of a song by phonetic symbols according to word information as shown in FIG. 4. The note length generating unit 312 generates a phoneme length based on music information and phonetic symbols as shown in FIG. 5.
Described below is the operation of generating note length information and a phoneme length by referring to the operational flowchart shown in FIG. 13.
1) First, a tempo symbol is extracted from music information. A tempo symbol represents the tempo of a performance. "Tl10" in line 1 of the music information shown in FIG. 2 indicates that the performance is given at the tempo of 110 quarter notes per minute. That is, the length of the quarter note is 60/110 second equal to 545 ms (step S 101).
2) Next, a note is checked in the music information. A note indicates the length in music information. For example, a quarter note, a dotted half note, etc. are commonly used (step S 102).
3) Then, generated is the relative length of a note in music information. For example, if a basic note is a quarter note as a tempo symbol, an eighth note indicates a half length of the basic note, and a half note indicates a double length of the basic note (step S 103 ).
4) A note timing is obtained according to a relative timing of a note. Since the basic note length is a quarter note of 545 ms, an eighth note indicates 272 ms, and a half note indicates 1090 ms (step S 104).
5) The timing of phonemes is generated from a generated note length. The length of a consonant and a vowel is generated according to predetermined rules. A note length is obtained by adding the length of a vowel and that of a consonant. For example, an eighth note for the word "Let's" is set to 16 ms for the consonant "1", 156 ms for the vowel "e", and 100 ms for "ts", that is, a total of 272 ms (step S 105).
The length of a phoneme of vowels, consonants, etc. can be obtained from the music information and the word information by repeating the above described processes. Then, the information is stored.
Next, FIG. 9 shows the configuration of the pitch information generating unit 313. In FIG. 9, the pitch information generating unit 313 comprises a basic pitch generating unit 3131, a portamento generating unit 3132, and a vibrato generating unit 3133.
Described below is the operation of the basic pitch generating unit 3131 by referring to the operational flowchart shown in FIG. 14.
1) First, the name of a musical pitch is extracted from the music information shown in FIG. 2, and a fundamental frequency is uniquely obtained using the name of the musical pitch (step S 201).
2) A fundamental frequency is obtained using a pitch name. A fundamental frequency corresponding to each pitch name in music information is preliminarily set in a conversion table, and a fundamental frequency corresponding to a pitch name is selected (step S 202).
3) According to a note length generated by the note length generating unit 312, a fundamental frequency pattern is generated for the length (step S 203).
The frequency pattern generated by repeating the above described processes according to music information is shown in FIG. 12A as a fundamental frequency pattern. Since each fundamental frequency discontinuously changes at this stage, the synthesized chorus sounds mechanical and unnatural as is.
Therefore, the portamento generating unit 3132 shown in FIG. 9 adjusts the fundamental frequency pattern shown in FIG. 12A into the one shown in FIG. 12B by adding a kind of a portamento (a smooth movement from a sound to another sound having a different pitch) so that the discontinuous portions in the fundamental frequency pattern generated by the basic pitch generating unit 3131 is adjusted into a continuous pattern and the fundamental frequency forms a smooth line.
FIG. 10 shows the configuration of the portamento generating unit 3132. The portamento generating unit 3132 comprises a portamento parameter 31321, portamento generation rules 31322, and a portamento processing unit 31323.
Described below is the operation of adding a portamento by the portamento processing unit 31323 by referring to the operational flowchart shown in FIG. 15.
1) First, it is determined whether or not a change has been made to a fundamental frequency. A change in a fundamental frequency refers to a discontinued portion of a fundamental frequency pattern in FIG. 12A. A process terminates if no change has been made to a fundamental frequency, and proceeds to its next step if any change has been made to the fundamental frequency (step S 301).
2) The portamento parameter 31321 is retrieved. If a fundamental frequency is changed to another fundamental frequency, then a parameter indicating, for example, the degree of portamento, time taken for adding portamento should be changed depending on the difference between the frequencies. The parameter is retrieved in this step (step S 302).
3) A section of portamento is obtained according to the portamento generation rules 31322. The portamento generation rules 31322 refer to predetermined rules such as functions. Using a portamento parameter retrieved in the previous step, it is obtained as to how much time is taken for portamento before and after a change in a fundamental frequency (step S 303).
4) A fundamental frequency for a portamento section is generated using the portamento generation rules 31322. A fundamental frequency can be obtained such that a smooth change can be made in the portamento section obtained in the previous step. Then, control is returned to step S 301 (step S 304).
FIG. 12B shows the fundamental frequency pattern obtained after adding portamento generated by repeating the above listed processes.
Next, vibrato is added as follows to the fundamental frequency pattern including the portamento as described above.
FIG. 11 shows the configuration of the vibrato generating unit 3133. The vibrato generating unit 3133 comprises a vibrato parameter 31331, vibrato generation rules 31332, and a vibrato processing unit 31333.
The operation of the vibrato processing unit 31333 is described below by referring to the operational flowchart shown in FIG. 16.
1) It is determined whether or not there is a section in which a fundamental frequency indicates a constant value. If no, the process terminates. If yes, control is passed to the next step S 402 (step S 401).
2) It is determined whether or not the constant section length is larger than a predetermined threshold length. If yes, control is passed to the next step. If no, control is returned to step S 401 (step S 402).
3) The vibrato parameter 31331 is retrieved. The vibrato parameter indicating vibrato which originally is a modulated frequency periodically provides a constant fundamental frequency with some hertz of frequency modulation, and the parameter refers to a modulated frequency, the amplitude of a modulation signal, etc. (step S 403).
4) A vibrato signal is generated according to the vibrato generation rules 31332. The vibrato generation rules 31322 are used in regulating a modulated frequency which is a vibrato signal for use in adding vibrato, the amplitude of a modulation signal, etc. (step S 404).
5) Thus, vibrato is added to a constant-fundamental frequency according to a vibrato signal, that is, a modulation signal. Then, control is returned to S 401 after the adding process (step S 405).
By repeating the above listed processes, the fundamental frequency pattern provided with portamento as shown in FIG. 12B is further provided with vibrato to form a fundamental frequency pattern shown in FIG. 12C.
The loudness information generating operation of the loudness information generating unit 314 shown in FIG. 8 is explained below by referring to the operational flowchart shown in FIG. 17.
1) A loudness symbol indicates the intensity of sound such as piano, forte, etc., and is retrieved from music information (step S 501).
2) The loudness adjustment amount corresponding to the retrieved loudness symbol is retrieved from a conversion table (step S 502).
3) The loudness adjustment start timing and the time taken for the adjustment is retrieved from music information. At the same time, the loudness adjustment amount obtained in the previous step is added to or subtracted from a reference loudness for a predetermined time (step S 503).
A singing voice signal generating unit 32 shown in FIG. 7 generates a singing voice from the fundamental frequency, loudness information, note length information, and phonetic symbols. For example, the unit can be a voice synthesizing device operated by a PARCOR method. The singing voice signals generated by the singing voice signal generating units 32 of singing voice signal synthesizing units 3a, 3b, and 3c of respective voice parts are added up in the chorus signal generating unit 4, output to the singing voice output unit 5, and then output as singing voices from the singing voice output unit 5 (for example, a speaker through an amplifier).
With the conventional singing voice synthesizing device, the change in a fundamental frequency of each voice part forming a chorus is made to be smooth, not discontinuous as shown in FIG. 12A to obtain a natural sound of a chorus. That is, a musical sound signal of a singing voice in a chorus is provided with kinds of portamento and vibrato as described above.
However, when the above mentioned portamento and vibrato are provided, the generation parameters and rules of the portamento and vibrato are common to all voice parts and therefore respective voice parts are provided with the same portamento and vibrato.
Furthermore, since the note length is common to all voice parts when control is passed from a note of a pitch to the next note of another pitch, the singing voice of each voice part proceeds to the next note at completely the same timing.
Each voice part is provided with vibrato having the same parameter. The vibrato does not provide an irregular frequency fluctuation normally detected in a singing voice, but is a simple frequency modulation in which a musical sound signal of a singing voice having a constant pitch is modulated with a modulation frequency of a few hertz.
Furthermore, if a single voice part gives a performance in a chorus, the loudness of the voice part is made the same as that of a normal chorus. Then, the single voice part performance gives the impression of insufficient loudness compared with a normal chorus and sounds insufficient in loudness.
As a result, a synthesized singing voice sounds unnatural and different from a live chorus.