Many studies have been made on estimation of spectral envelopes, but estimating an appropriate envelope is still difficult. There have been some studies on application of group delays to sound synthesis, and such application needs time information called pitch marks.
For example, source-filter analysis (Non-Patent Document 1) is an important way to deal with human sounds (singing and speech) and instrumental sounds. An appropriate spectral envelope obtained from an audio signal (an observed signal) can be useful in a wide application such as high-accuracy sound analysis and high-quality sound synthesis and transformation. If phase information (group delays) can appropriately be estimated in addition to an estimated spectral envelope, naturalness of synthesized sounds can be improved.
In the field of sound analysis, great importance has been put on amplitude spectrum information, but little focus on phase information (group delays). In sound synthesis, however, the phase plays an important role for perceived naturalness. In sinusoidal synthesis, for example, if an initial phase is shifted from natural utterance more than π/8, perceived naturalness is known to be reduced monotonically according to the magnitude of shifting (Non-Patent Document 2). Also, in sound analysis and synthesis, the minimum phase response is known to have better naturalness than the zero-phase response in obtaining an impulse response from a spectral envelope to define a unit waveform (a waveform for one period) (Non-Patent Document 3). Further, there have been studies on phase control of unit waveform for improved naturalness (Non-Patent Document 4).
Further, many studies have been made on signal modeling for high-quality synthesis and transformation of audio signals. Some of the studies do not use supplemental information, some of them are accompanied by F0 estimation as supplemental information, and others need phoneme labels. As a typical technique, the Phase Vocoder (Non-Patent Documents 5 and 6) deals with input signals in the form of power spectrogram on the time-frequency domain. This technique enables temporal expansion and contraction of periodic signals, but suffers from reduced quality due to aperiodicity and F0 fluctuation.
In addition, LPC (Linear Predictive Coding) analysis (Non-Patent Documents 7 and 8) and cepstrum are widely known as conventional techniques for spectral envelope estimation. Various modifications and combinations of these techniques have been proposed (Non-Patent Documents 9 to 13). Since the contour of the envelope is determined by the order of analysis in LPC or cepstrum, the envelope cannot appropriately be represented in some order of analysis.
In PSOLA (Pitch Synchronized Overlap-Add) (Non-Patent Documents 1 and 14) known as a conventional F0-adaptive analysis technique, estimated F0 is used as supplemental information. Time-domain waveforms are cutout as unit waveforms based on pitch marks, and the unit waveforms thus cut out are overlap-added in a fundamental period. This technique can deal with changing F0 and stored phase information helps provide high-quality sound synthesis. This technique still has problems such as difficult pitch mark allocation as well as F0 change and reduced quality of non-stationary sound.
Also in sinusoidal models of voice and music signals (Non-Patent Documents 15 and 16), F0 estimation is used for modeling the harmonic structure. Many extensions of these models have been proposed such as modeling of harmonic components and broadband components (noise, etc.) (Non-Patent Documents 17 and 18), estimation from the spectrogram (Non-Patent Document 19), iterative estimation of parameters (Non-Patent Documents 20 and 21), estimation based on quadratic interpolation (Non-Patent Document 22), improved temporal resolution (Non-Patent Document 23), estimation of non-stationary sounds (Non-Patent Documents 24 and 25), and estimation of overlapped sounds (Non-Patent Document 26). Most of these sinusoidal models can provide high-quality sound synthesis since they use phase estimation, and some of them has high temporal resolution (Non-Patent Documents 23 and 24).
STRAIGHT, a system (VOCODER) based on source-filter analysis incorporates F0-adaptive analysis and is widely used in the speech research community throughout the world for its high-quality sound analysis and synthesis. In STRAIGHT, the spectral envelope can be obtained with periodicity being removed from an input audio signal by F0-adaptive smoothing and other processing. The system provides high-quality and has high temporal resolution. Extensions of this system are TANDEM STRAIGHT (Non-Patent Document 28) which eliminates temporal fluctuations by use of tandem windows, emphasis placed on spectral peaks (Non-Patent Document 29), and fast calculation (Non-Patent Document 30). In the STRAIGHT system and these extensions, the following techniques, for example, are introduced to attempt to improve naturalness of synthesized sounds: the mixed mode excitation with Gaussian noise convoluted with non-periodic components (defined as components which cannot be represented by the sum of harmonics or response driven by periodic pulse trains) without estimating the original phase, and the group delay randomization in the high frequency range. However, the standards for phase manipulation have not been established. Further, excitation extraction (Non-Patent Document 31) extracts excitation signals by deconvolution of the original audio signal and impulse response waveforms of the estimated envelope. It cannot be said that this technique efficiently represents the phase and it is difficult to apply the technique to interpolation and conversion. Some studies on sound analysis and synthesis (Non-Patent Documents 32 and 33), which estimate and smooth group delays, need pitch marks.
In addition to the foregoing studies, there are some studies such as Gaussian mixture modeling (GMM) of the spectral envelope, STRAIGHT spectral envelope modeling (Non-Patent Document 34), and formulated joint estimation of F0 and spectral envelope (Non-Patent Document 35).
Common problems to the studies described so far are: the analysis is limited by local observation and only the harmonic structure (frequency components of integer multiple of F0) is modeled, and transfer functions between adjacent harmonics can be obtained only with interpolation.
Further, some studies utilize phoneme labels as supplemental information. For example, attempts have been made to estimate a true envelope by integrating spectra at different F0 (different frames) using the same phoneme as the time of analysis for the purpose of estimating unobservable envelope components between harmonics (Non-Patent Documents 36 through 38). One of such studies is directed not to a single sound but to vocal in a music audio signal (Non-Patent Document 39). This study assumes that the same phoneme has a similar vocal tract shape. In this case, accurate phoneme labels are required. Furthermore, if target sound such as singing voice fluctuates largely depending upon the context, it may lead to excessive smoothing.
JP10-97287A (Patent Document 1) discloses an invention comprising the steps of: convoluting a phase adjusting component with a random number and band limit function on the frequency domain to obtain a band limited random number; multiplying a target value of delay time fluctuation by the band limited random number to obtain group delay characteristics; calculating an integral of the group delays with frequency to obtain phase characteristics; and multiplying the phase characteristics by an imaginary unit to obtain an exponent of exponential function, thereby obtaining phase adjust components.