Speech and singing differ significantly in terms of their production and perception by humans. In singing, for example, the intelligibility of the phonemic message is often secondary to the intonation and musical qualities of the voice. Vowels are often sustained much longer in singing than in speech, and precise, independent control of pitch and loudness over a large range is required. These requirements significantly differentiate synthesis of singing from speech synthesis.
Most previous approaches to synthesis of singing have relied on models that attempt to accurately characterize the human speech production mechanism. For example, the SPASM system developed by Cook (P. R. Cook, "SPASM, A Real Time Vocal Tract Physical Model Controller And Singer, The Companion Software Synthesis System," Computer Music Journal, Vol. 17, pp. 30-43, Spring 1993.) employs an articulator-based tube representation of the vocal tract and a time-domain glottal pulse input. Formant synthesizers such as the CHANT system (Bennett, et al., "Synthesis of the Singing Voice," in Current Directions in Computer Music Research, pp. 19-49, MIT Press 1989.) rely on direct representation and control of the resonances produced by the shape of the vocal tract. Each of these techniques relies, to a degree, on accurate modeling of the dynamic characteristics of the speech production process by an approximation to the articulartory system. Sinusoidal signal models are somewhat more general representations that are capable of high-quality modeling, modification, and synthesis of both speech and music signals. The success of previous work in speech and music synthesis motivates the application of sinusoidal modeling to the synthesis of singing voice.
In the article entitled, "Frequency Modulation Synthesis of the Singing Voice," in Current Directions in Computer Research, (pp. 57-64, MIT Press, 1989) John Chowning has experimented with frequency modulation (FM) synthesis of the singing voice. This technique, which has been a popular method of music synthesis for over 20 years, relies on creating complex spectra with a small number of simple FM oscillators. Although this method offers a low-complexity method of producing rich spectra and musically interesting sounds, it has little or no correspondence to the acoustics of the voice, and seems difficult to control. The methods Chowning has devised resemble the "formant waveform" synthesis method of CHANT, where each formant waveform is created by an FM oscillator.
Mather and Beauchamp in an article entitled, "An Investigation of Vocal Vibrato for Synthesis," in Applied Acoustics, (Vol. 30, pp. 219-245, 1990) have experimented with wavetable synthesis of singing voice. Wavetable synthesis is a low complexity method that involves filling a buffer with one period of a periodic waveform, and then cycling through this buffer to choose output samples. Pitch modification is made possible by cycling through the buffer at various rates. The waveform evolution is handled by updating samples of the buffer with new values as time evolves. Experiments were conducted to determine the perceptual necessity of the amplitude modulation which arises from frequency modulating a source that excites a fixed-formant filter--a more difficult effect to achieve in wavetable synthesis than in source/filter schemes. They found that this timbral/amplitude modulation was a critical component of naturalness, and should be included in the model.
In much previous singing synthesis work, the transitions from one phonetic segment to another have been represented by stylization of control parameter contours (e.g., formant tracks) through rules or interpolation schemes. Although many characteristics of the voice can be approximated with such techniques after painstaking hand-tuning of rules, very natural-sounding synthesis has remained an elusive goal.
In the speech synthesis field, many current systems back away from specification of such formant transition rules, and instead model phonetic transitions by concatenating segments from an inventory of collected speech data. For example, this is described by Macon, et al. in article in Proc. of International Conference on Acoustics, Speech and Signal Processing (Vol. 1, pp. 361-364, May 1996) entitled, "Speech Concatenation and Synthesis Using Overlap-Add Sinusoidal Model."
For Patents see, E. Bryan George, et al. U.S. Pat. No. 5,327,518 entitled, "Audio Analysis/Synthesis System" and E. Bryan George, et al. U.S. Pat. No. 5,504,833 entitled, "Speech Approximation Using Successive Sinusoidal Overlap-Add Models and Pitch-Scale Modifications." These patents are incorporated herein by reference.