Instantaneous frequency is a concept which has been naturally expanded from the concept of frequency to any signals that change with time. Instantaneous frequency has many characteristics suitable for representation of a nonstationary signal such as a voice signal. The characteristics have been applied to signal processing of various types: (1) voice coding on the basis of a sinusoidal-wave model, (2) Formant extraction and band-width estimation, (3) extraction of the harmonic structure of voiced sound, (4) extraction of a fundamental frequency, and (5) interesting computation model for auditory information processing. Hereinafter, the frequencies, phases, and fundamental frequencies of component sinusoidal waves of a sinusoidal-wave model; their strengths in terms of periodicity (or the ratio between periodic components and aperiodic components); etc. are collectively referred to as “sound-source information.” However, important potentialities of this concept; in particular, extraction of sound-source information of speech sound, has not yet been studied sufficiently. Recent studies in this aspect have revealed that use of instantaneous frequency leads to a considerably excellent method for extracting sound-source information.
In the case in which a conspicuous sinusoidal-wave component is present in a passband common among a plurality of bandpass filters having different center frequencies, the outputs of the bandpass filters have been known to assume a substantially constant instantaneous frequency. In other words, mapping from filter center frequency to output instantaneous frequency yields a fixed point in the vicinity of the conspicuous signal frequency. This property is used for extraction of conspicuous resonance such as harmonic components of complex sound and Formant of speech sound. Further, it has been pointed out that this property is related to the phenomenon of synchronous ignition between different auditory nerves; and modeling by “synchrony strand” has been developed as a model for representing a corresponding auditory entity. However, there has not been a clear idea to integrate these thoughts into a consistent F0 extraction method.
The present inventor has recently proposed a high-quality system for analysis, conversion, and synthesis of voice, called “STRAIGHT.” STRAIGHT is obtained through refining the concept of a classical channel vocoder on the basis of generalized pitch synchronization analysis. In the present specification, the conventionally-used term “pitch synchronization analysis” is used. In the field of voice information processing, the term “pitch” is used to express the same meaning as that of fundamental frequency (F0). However, this is inaccurate use of the term. F0, which represents a physical attribute, is essentially different from pitch, which represents a psychological attribute. In the present specification, the term “pitch” is not used, except for the case in which psychological attributes are mentioned. In the STRAIGHT method, since analysis adapted for F0 is performed, accurate and reliable F0 information is needed for each fundamental period of voiced sound, which is defined to be a single open/close cycle of the glottis. The inventor carried out studies while applying various conventionally-proposed F0-extraction methods and as a result found that conventional methods cannot satisfy the requirement on temporal resolution and the requirement on frequency accuracy. Further, the inventor found that in the case in which an extracted F0 contains a discontinuous component or a component that varies at high speed, the perceptual quality of voice synthesized on the basis of the F0 information deteriorates, even if the absolute values of the components are small. Moreover, the inventor found that judgment of unvoiced sound/voiced sound greatly affects synthesis of perceptually high-quality voice, and in some cases, temporal accuracy of a few milliseconds or less is demanded. Also, it was found that when a bias in a particular direction is not present, a trend component which gradually changes the F0 has no adverse perceptual influence on synthesized voice.
Heretofore, many FO-extraction methods and apparatus have been proposed: time domain algorithm on the basis of interval measurement, frequency-domain method on the basis of spectrum, a method in which autocorrelation and harmonic sieve (sieve for extracting harmonic components) are used singly or in combination, and a biologically-motivated method. These methods and apparatus premise that a signal to be analyzed is a periodic signal from the viewpoint of mathematics. In each of these methods and apparatus, a value estimated on the basis of periodicity from the viewpoint of mathematics provides a correctly estimated FO value for a signal whose FO is constant over time. However, it is not clear whether conventional methods and apparatus can provide correctly estimated FO values in analysis of a real voice, where FO changes with time, or in analysis of complex sound in which the frequencies of sinusoidal-wave components deviate slightly from a harmonic relation.
In the proposed high-quality voice conversion system, conversion and re-synthesis of voice must be performed on the basis of accurate sound-source information of an original voice. Therefore, in order to improve this method, an FO-extraction method can rationally be applied to a signal whose FO changes with time and a signal which includes non-harmonic components. Such an observation motivates the inventor to develop a new FO-extraction method and apparatus which produces an accurate FO locus with high temporal resolution by use of the instantaneous frequency of the fundamental component.
In the STRAIGHT method, an FO-extraction method based on instantaneous frequency has been developed and used on the assumption that a filtered signal containing a fundamental-wave component involves minimal AM modulation and FM modulation. The FO-extraction method used in the STRAIGHT method exhibited agreeable performance in an evaluation test which was performed while an EGG (Electro Glotto Graph) signal recorded simultaneously with voice was used as a reference signal. For example, in analysis of 100 sentences spoken by an adult female speaker, the error between FO obtained from voice and FO obtained from FGG became 20% or higher only in 1.4% of all analyzed frames. Further, in 53% of all analyzed frames, the FO obtained from voice fell within 0.3% of the FO obtained from FGG. However, the above-described assumption of minimal AM and FM modulation is formulated ambiguously, and the formula is not effective mathematically. Further, this method involves a problem in that standard deviation of errors of FO regarding an adult male voice becomes about double that for an adult female voice.
The present invention provides a necessary mathematical base for enabling a new FO-extraction method and apparatus, which is an expansion of the above-described method. Detailed studies on partial differentiation of a function representing the relation between a filter center frequency and an output instantaneous frequency at a fixed point were key to providing a necessary mathematical base. Thus, the present invention leads to a new consistent FO/sound-source information extraction method and apparatus which utilizes a non-stationary aspect of the concept of instantaneous frequency.
An object of the present invention is to provide a method and apparatus for extracting sound-source information, which method enables the characteristics of fixed points of mapping from filter center frequency to output instantaneous frequency to be detected from instantaneous data, as a value which can be interpreted quantitatively.
[1] In a method and apparatus for extracting sound-source information by use of fixed points of mapping from frequency to instantaneous frequency, instantaneous frequency of each filter is partial-differentiated with respect to frequency to thereby obtain a first value; output of each filter is partial-differentiated with respect to frequency and then with respect to time to thereby obtain a second value; and proper weights are imparted to the first and second values and short-time weighted integration with respect to time is performed to estimate a carrier-to-noise ratio of each filter, whereby a carrier-to-noise ratio is obtained, and an estimated value of evaluation value is obtained.
[2] In the method and apparatus for extracting sound-source information described in [1] above, on the basis of the evaluation value estimated by use of the carrier-to-noise ratio, a logarithm-frequency-axis analogous filter is used for selection of a fixed point corresponding to a fundamental frequency, and the fundamental frequency is extracted without advance information regarding the fundamental frequency.
[3] In the method and apparatus for extracting sound-source information described in [2] above, the logarithm-frequency axis analogous filter and a linear-frequency-axis analogous adapted chirp filter are used in combination in order to extract the fundamental frequency without advance information regarding the fundamental frequency and to improve the accuracy of the extracted fundamental frequency.