When, in an analysis/synthesis of speech sound, an intonation of speech sound is controlled or when speech sound is synthesized for editorial purposes to provide the intonation of natural speech sound, the fundamental frequency of speech sound should be converted while maintaining the tone of the original speech sound. When sound in the natural world is sampled for use as a sound source of an electronic musical instrument, the fundamental frequency should be converted while maintaining constant tone. In such conversion of the fundamental frequency, the fundamental frequency should be set more finely than the resolution determined by the sampling period. When speech sound is changed so as to conceal the individual features of an information provider for the purpose of protecting his/her privacy, the tone should be changed with the pitch unchanged, or the tone and pitch should be changed.
There is an increasing demand for reuse of existing speech sound resources such as synthesizing voices of different actors into a new voice without employing a voice actor. As society ages, there will be more people with a difficulty of hearing speech sound or music due to various kinds of hearing impairment or cognitive impairment. There is therefore a strong demand for a method of converting the speed, frequency band, or pitch of a voice to be adapted to the deteriorated hearing or cognitive ability with no loss of original information.
To achieve such an object, a model representing a spectral envelope is assumed, and the parameters of the model are optimized by approximation taking into consideration the spectrum peak under an appropriate evaluation function to seek a spectral envelope (for example, see “Speech Analysis Synthesis System Using the Log Magnitude Approximation Filter” by Satoshi IMAI and Tadashi KITAMURA, Journal of the Institute of Electronic and Communication Engineers, 78/6, Vol. J61-A, No. 6, pp 527-534).
Further, the idea of periodic signals is combined into a method of estimating parameters for an autoregressive model (for example, see “A Formant Extraction not influenced by Pitch Frequency Variations” by Kazuo Nakata, Journal of Japanese Acoustic Sound Association, Vol. 50, No. 2 (1994), pp 110-116).
Any of the related art techniques is based on the assumption of a specific model, so the related art techniques cannot provide correct estimation of a spectral envelope unless the number of parameters to describe a model should be appropriately determined. In addition, if the nature of a signal source is different from an assumed model, a component resulting from the periodicity is mixed in the estimated spectral envelope, and an even larger error may occur. Furthermore, the related art techniques require iterative operations for convergence in the process of optimization, and therefore are not suitable for applications with a strict time limitation such as real-time processing.
In addition, in terms of control of the periodicity, since the sound source and the spectral envelope are separated as a pulse train and a filter, respectively, the periodicity of a signal may not be specified with higher accuracy than the temporal resolution determined by a sampling frequency.
In another related art technique, speech sound processing referred to as PSOLA (Pitch Synchronous OverLap Add) is performed by reduction/expansion of waveforms and time-shifted overlapping in the temporal domain.
In this related art technique, if the periodicity of the sound source is changed by about 20% or more, speech sound is deprived of its natural quality, and speech sound cannot be converted in a flexible manner.
In the related art techniques, in terms of extraction of the fundamental frequency, design is carried out with no logical conclusion of the conditions for extraction of the fundamental frequency based on speech synthesis, so reasonable design is not carried out. In addition, there is no principle of the temporal resolution, and the size of a time window is determined by a trial-and-error method or the like. For this reason, when a signal synthesized using the extracted fundamental frequency is re-analyzed, a fundamental frequency different from the fundamental frequency used for synthesis may be obtained.
In the related art techniques, since the physical attributes are not systematically associated with aperiodicity, an influence by temporal changes in the fundamental frequency and temporal changes in the spectrum may be extracted as an aperiodic component, and as a result, an accurate value for synthesis may not be extracted.