1. Field of the Invention
A method and apparatus for converting spoken context-independent triphones of one speaker into a singing voice sung in the manner of a target singer and more particularly for performing singing voice synthesis from spoken speech.
2. Prior Art
Many of the components of speech systems are known in the art. Pitch detection (also called pitch estimation) can be done through a number of methods. General methods include modified autocorrelation (M. M. Sondhi, “New Methods of Pitch Extraction”. IEEE Trans. Audio and Electroacoustics, Vol. AU-16, No. 2, pp. 262-266, June 1968.), spectral methods (Yong Duk Cho; Hong Kook Kim; Moo Young Kim; Sang Ryong Kim, “Pitch Estimation Using Spectral Covariance Method for Low-Delay MBEvocoder”, Speech Coding For Telecommunications Proceeding, 1997, 1997 IEEE Workshop, Volume, Issue, 7-10 Sep. 1997 Page(s): 21-22.), wavelet methods (Hideki Kawahara, Ikuyo Masuda-Katsuse, Alain de Cheveigne, “Restructuring speech representations using STRAIGHT-TEMPO: Possible role of a repetitive structure in sounds”, ATR-Human Information Processing Research Laboratories (Technical Report). 1997).
Time-scaling of voice is also a product that has been well-described in the art. There are two general approaches to performing time-scaling. One is time-domain scaling. In this procedure, a signal is taken and autocorrelation is performed to determine local peaks. The signal is split into frames according to the peaks outputted by the autocorrelation method and these frames are duplicated or removed depending on the type of scaling involved. One such implementation of this idea is the SOLAFS algorithm (Don Hejna, Bruce Musicus, “The SOLAFS time-scale modification algorithm”, BBN, July 1991.).
Another method of time-scaling is through a phase vocoder. A vocoder takes a signal and performs a windowed Fourier transform, creating a spectrogram and phase information. In time-scaling algorithm, windowed sections of the Fourier transform are either duplicated or removed depending on the type of scaling. The implementations and algorithms are described in (Mark Dolson, “The phase vocoder: A tutorial,” Computer Music Journal, vol. 10, no. 4, pp. 14-27, 1986.) and (Jean Laroche, Mark Dolson, “New Phase Vocoder Technique for Pitch-Shifting, Harmonizing and Other Exotic Effects”. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. Mohonk, New Paltz, N.Y. 1999.).
Voice analysis and synthesis is a method of decomposing speech into representative components (in the analysis stage) and manipulating those representative components to create new sounds (synthesis stage). In particular, this process uses a special type of voice analysis/synthesis tool on the source-filter model, which breaks down speech into an excitation noise (produced by vocal folds) and a filter (produced by the vocal tract). Examples and descriptions of voice analysis-synthesis tools can be found in (Thomas E. Tremain, “The Government Standard Linear Predictive Coding Algorithm: LPC-10”, Speech Technology Magazine, April 1982, p. 40-49.), (Xavier Serra, “Spectral Modeling Synthesis: A Sound Analysis/Synthesis System based on a Deterministic plus Stochastic Decomposition”, Computer Music Journal, 14(4):12-24, 1990.), (Mark Dolson, “The phase vocoder: A tutorial,” Computer Music Journal, vol. 10, no. 4, pp. 14-27, 1986.).
The closest known prior art to the present invention is the singing voice synthesis method and apparatus described in U.S. Pat. No. 7,135,636, which produces a singing voice from a generalized phoneme database. The purpose of the method and apparatus of the patent was to create an idealized singer that could sing given a note, lyrics, and a phoneme database. However, the ultimate characteristic of maintaining the identity of the original speaker was not intended according to the method and apparatus of the patent. A principal drawback of the method and apparatus of the patent is the inability to achieve the singing voice of the singer but sung in the manner of a target singer.