1. Field of the Invention
The invention relates to a speech synthesis system and a method of synthesizing speech, and more particularly, to a speech segment coding and a pitch control method which significantly improves the quality of the synthesized speech.
The principle of the present invention can be directly applied not only to speech synthesis but also to synthesis of other sounds, such as, the sounds of musical instruments or singing, each of which has a property similar to that of speech, or to a very low rate speech coding or speech rate conversion. The present invention will be described below concentrating on speech synthesis.
There are speech synthesis methods for implementing a text-to-speech synthesis system which can synthesize countless vocabularies by converting text, that is, character strings, into speech. However a method which is easy to implement and most generally utilized is speech segmental synthesis method, also called synthesis-by-concatenation method, in which the human speech is sampled and analyzed into phonetic units, such as demisyllables or diphones, to obtain short speech segments, which are then coded and stored in memory, and when the text is inputted, it is converted into phonetic transcriptions. Speech segments corresponding to the phonetic transcriptions are then sequentially retrieved from the memory and decoded to synthesize the speech corresponding to the input text.
In this type of segmental speech synthesis method, one of the most important elements to govern the quality of the synthesized speech is the coding method of the speech segments. In the prior art speech segmental synthesis method of the speech synthesis system, a vocoding method of low speech quality is mainly used as the speech coding method for storing speech segments. However this is one of the most important causes which lowers the quality of synthesized speech. A brief description with respect to the prior art speech segment coding method follows.
The speech coding method can be largely classified into a waveform coding method of good speech quality and a vocoding method of low speech quality. Since the waveform coding method is a method which intends to transfer the speech waveform as it is, it is very difficult to change pitch frequency and duration so that it is impossible to adjust intonation and rate of speech when performing the speech synthesis. Also it is impossible to conjoin the speech segments therebetween smoothly so that the waveform coding method is basically not suitable for coding the speech segments.
On the contrary, when the vocoding method (also called an analysis-synthesis method) is used, the pitch pattern and the duration of the speech segment can be arbitrarily changed. Further, since the speech segments can also be smoothly conjoined by interpolating the spectral envelope estimation parameters so that the vocoding method is suitable for the coding means for text-to-speech synthesis, vocoding methods, such as linear predictive coding (LPC) or formant vocoding, is adopted in most present speech synthesis systems. However, since the quality of decoded speech is low when the speech is coded using the vocoding method, the synthesized speech obtained by decoding the stored speech segments and concatenating them can not have better speech quality than that offered by the vocoding method.
Attempts made so far to improve speech quality offered by the vocoding method replaces the impulse train used with an excitation signal that has a less artificial waveform. One such attempt was to utilize, a waveform having peakiness lower than that of the impulse, for example a triangular waveform or a half circle waveform or a waveform similar to a glottal pulse. Another attempt was to select a sample pitch pulse of one or some of residual signal pitch periods obtained by inverse filtering and to utilize instead of the impulse, one sample pulse for the entire time period or for a substantially long time period. However, such attempts to replace the impulse with an excitation pulse of other waveforms have not improved the speech quality or have improved it only slightly, if ever, and have never obtained synthesized speech with a quality proximating that of natural speech.
It is the object of the present invention to synthesize high quality speech having a naturalness and an intelligibility with the same degree as that of human speech by utilizing a novel speech segment coding method enabling good speech quality and pitch control. The method of the present invention combines the merits of the waveform coding method which provides good speech quality but without the ability to control the pitch and the vocoding method which provides pitch control but has low speech quality.
The present invention utilizes a periodic waveform decomposition method which is a coding method which decomposes a signal in a voiced sound sector in the original speech into wavelets equivalent to one-period speech waveforms made by glottal pulses to code and store the decomposed signal, and a time warping-based wavelet relocation method which is a waveform synthesis method capable of arbitrary adjustment of the duration and pitch frequency of the speech segment while maintaining the quality of the original speech by selecting wavelets nearest to positions where wavelets are to be placed among stored wavelets, then by decoding the selected wavelets and superposing them. For purposes of this invention musical sounds are treated as voiced sounds.
The preceding objects should be construed as merely presenting a few of the more pertinent features and applications of the invention. Many other beneficial results can be obtained by applying the disclosed invention in a different manner or modifying the invention within the scope of the disclosure. Accordingly, other objects and a fuller understanding of the invention may be had by referring to both the summary of the invention and the detailed description, below, which describe the preferred embodiment in addition to the scope of the invention defined by the claims considered in conjunction with the accompanying drawings.