1. Field of the Invention
The present invention relates to a method for analyzing pitches and powers of voices in detail, a method and a medium for synthesizing high quality voices, and compressing and encoding voices efficiently using the analyzing method.
2. Related Art of the Invention
An object of a voice synthesizing system is to synthesize given contents of a voice as voice waveforms. There have been invented various methods for synthesizing voices so far. A representative method among them is a waveform editing and synthesizing method that stores voice waveforms in a fine unit in advance (in synthesis units), then select and connect proper units appropriately to target contents.
In such a voice synthesizing method, feelings of discontinuation and wrongness generated when units are connected can be lowered by changing the pitch and the time length of each unit, thereby to synthesize voices smoothly. One of the well-known methods for changing pitches and time lengths such way is, for example, the PSOLA (Pitch Synchronous Overlap Add) method (F. Charpentier, M. Stella, xe2x80x9cDiphone synthesis using an over-lapped technique for voice waveforms concatenationxe2x80x9d, Proc. ICASSP, 2015-2018, Tokyo, 1986). In this method, pitch marks are assigned to local peak positions and glottal closures of unit waveforms in advance, so that pitch waveforms are selected out around each of those pitch-marked positions using a window function. Voices are thus synthesized properly.
As a pitch marking method used for voice synthesizing as described above, there are methods in which pitch marks are assigned to local peaks of time waveforms and to glottal closures. An example of the method for assigning pitch marks to local peaks of time waveforms is introduced in xe2x80x9cConstructing a Waveform Inventory for Text-to-Speech Synthesis Based on Waveform Splicingxe2x80x9d (Proc. Autumn Meeting Acoust. Soc. Japan, 3-5-5, 1994-11). The advantage of this method is simplicity. For complicated voice waveforms including many high frequency components, however, it is difficult to assign a pitch mark to each pitch cycle. In addition, the peak itself has a time fluctuation caused by such high frequency components. Consequently, synthesized waveforms have a phase fluctuation in each pitch cycle. This then arises a problem of thick voices, which makes listeners feel uncomfortable.
On the other hand, a method for assigning pitch marks to glottal closures of voice waveforms is introduced in M. Sakamoto et al.: xe2x80x9cA New Waveform Overlap-Add Technique for Text-to-Speech Synthesisxe2x80x9d, Technical Report of IEICE SP95-6 (1995-05) and by Y. Arai et al.: xe2x80x9cA Study on the Optimal Window Position to Extract Pitch Waveforms Based on a Speech Signal Model.xe2x80x9d, Proc. Spring meeting Acoust. Soc. Japan, 1-4-22, 1995-3. In the method, voice waveforms are analyzed using a wavelet transform method and a linear prediction analysis method, thereby to presume a glottal closure timing and assign a pitch mark to the timing position. The glottal closure extracting method has an advantage that one pitch mark can be assigned accurately to each pitch cycle. Since this method is equivalent to a method for selecting out response waveforms corresponding to glottal closure pulses, pitch waveforms can be selected out with less spectrum distortion. The method is thus favorable from the viewpoint of selecting out waveforms. This method, however, has a problem that the method for analyzing and presuming glottal closure is complicated.
In addition to those methods, there is also a technology for extracting fundamental component of a voice using an FIR linear phase band-pass filter that specifies a passing band around the voice pitch frequency adaptively and partitioning the voice waveform for each pitch cycle using a zero-cross position. The technology is introduced in xe2x80x9cFine Pitch Contour Extraction by voice Fundamental Wave Filtering Methodxe2x80x9d, Journal of Acoust. Soc. Japan, Vol.51, No.7, pp.509-518, 1995. This method is used to analyze fine pitches, but it is also used to find pitch cycles synchronizing with fundamental waveform.
A partitioning point extracted by the above method is not related directly to any of local peaks and glottal closures of voice waveforms. It is not proper therefore to use such a partitioning point as a pitch mark with no change sometimes.
As described above, the method for using a local peak on time waveforms as a pitch mark has a problem that thick voices are generated in synthesized voices, since the pitch mark includes a fluctuation generated around each peak of time waveforms. And, the method for using a glottal closures as a pitch mark has a problem that the processing for presuming glottal closures is complicated. In addition, the method for filtering fundamental component also has a problem that a proper timing to be used as a pitch mark cannot be extracted.
Under such the circumstances, it is an object of the present invention to provide a method for analyzing voices, which can assign pitch marks more simply and more properly than related arts and a method and a medium for synthesizing higher quality voices than the related arts.
One aspect of the method according to the invention is for analyzing voices which generates pitch mark information assumed to be time reference positions corresponding to a pitch cycle of voice waveforms, by using means for storing voice waveforms; means for analyzing pitches; an adaptive filter; and means for detecting peaks, wherein
some of said voice waveforms are stored temporarily using said voice waveform storing means;
rough pitch information is generated from said voice waveforms stored temporarily, by using said pitch analyzing means;
said voice waveforms stored temporarily is entered to said adaptive filter and by changing a cut-off frequency or a center frequency of said adaptive filter according to said rough pitch information, only fundamental component extracted from the entered voice waveforms is passed; and
plural maximum points are detected at one side of said basic waves by using said peak detecting means, thereby to generate a series of accurate pitch mark information for the whole voice waveforms.
A method of claim 2 is for analyzing voices, which generates pitch mark information assumed to be time reference positions corresponding to a pitch cycle of voice waveforms by using plural peak detecting channels each of which is a set of a fixed low-pass filter and a peak detecting means, and means for selecting a channel, wherein
cut-off frequencies of said plural fixed low-pass filters are set so that at least one of said plural fixed low-pass filters passes only fundamental component of entered voice waveforms;
each of said fixed low-pass filters is used to output waveforms of low frequency components of specified frequencies of the entered voice waveforms;
said peak detecting means is used to detect plural maximum points on one side of waveforms of said low frequency components output from said fixed low-pass filter and to output said detected plural maximum points as a peak information;
said channel selecting means is used to select a peak detecting channel every a predetermined period on a basis of a specified selection reference by using all or some of the peak informations output from said plural peak detecting channels; and
a series of pitch mark information is generated for the whole voice waveforms by using the peak information output from said selected peak detecting channel.
Still another aspect of the method according to the invention is for synthesizing voices where by analyzing target voice waveforms which are recorded in advance, phoneme series information, phoneme timing information, pitch information, amplitude information are generated, and
voices are synthesized according to said phoneme series information, said phoneme timing information, said pitch information, and said amplitude information, wherein said phoneme series information holds types of phonemes and their appearance order in said target voice waveforms;
said pitch information holds information related to a pitch for each specified timing of said target voice waveforms; and
said amplitude information holds information related to an amplitude of each specified timing of said target voice waveforms.
Yet another aspect of the method according to the invention is for synthesizing voices, which synthesizes a specified message by combining regular messages of natural voices and synthesized messages of synthesized voices, wherein
pitch mark information corresponding to said natural voices is assigned in advance;
at least at connected portion between said regular message and said synthesized message,
pitch waveforms of voice waveforms used for synthesizing voices of said synthesized message are disposed according to said pitch mark information thereby to synthesize as a synthesized message voices of the same contents as those of said regular message; and
both voices having same contents are superimposed with changing a mixing rate of them at said connected portion.
Still another aspect of the method according to the invention is for synthesizing voices to generate a specified message by combining a first message and a second message, wherein
pitch waveforms of voice waveforms used for synthesizing said first message are disposed according to a pitch mark information corresponding to natural voices recorded in advance for each type of said first messages, thereby to generate said first message;
at least at a connected portion between said first message and said second message,
voices of the same contents as those of said first message are synthesized as said second message, then
said first and second messages are superimposed at said connected portion with changing in time the mixing rate of said first and second messages having the same contents.
A medium of claim 44 is storing a program used to have a computer execute all or some of steps described in any one of above inventions.
A medium of claim 45 is for storing a program used to have a computer execute all or some of steps described in any one of above inventions.
According to configurations described above, for example it is easy to extract partitioning points corresponding to pitch cycles, since local peaks are detected from sinusoidal waveforms. Furthermore, since not zero-cross points but peak positions are extracted as partitioning points, pitch marks can be assigned to positions matching almost with local peaks and glottal closures points of voice waveforms.