This application claims the priority of Korean Patent Application No. 2001-67623, filed Oct. 31, 2001, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein in its entirety by reference.
1. Field of the Disclosure
The present invention relates to a speech synthesis system, and more particularly, to a system and method for synthesizing speech in which a smoothing technique is applied to the transition portion between concatenated speech units of the synthesized speech, thereby preventing a discontinuous distortion at the transition portion.
2. Description of the Related Art
In general, a Text-to-Speech (hereinafter, referred to as “TTS”) system refers to a type of speech synthesis system in which a user enters a text, optionally in a computer document, to automatically create a speech or a spoken sound version of the text using a computer, etc., so that the contents of the text thereof can be read aloud to other users. Such a TTS system is widely used in an application field such as an automatic information system (AIS), which is one of key technologies for implementing conversation of a human being with a machine. This TTS system has been used to create a synthesized speech closer to a human speech since a corpus-based TTS was introduced. The corpus-based TTS is based on a large capacity data base in the 1990s. Further, an improvement in the performance of a prosody prediction method to which a data-driven technique is applied results in a creation of more animated speech.
However, despite this technological development, there has been a problem in that a discontinuity occurs at the transition portion between the concatenated speech units of synthesized speech. A speech synthesis system basically concatenates respective small speech segments according to a row of speech units as phonemes to form a complete speech signal so as to produce a concatenative spoken sound. Accordingly, when adjacent speech segments have different characteristics, there may occur a distortion during hearing of an output speech. Such a hearing distortion may be represented in a form of a trembling of the speech due to rapid fluctuations and discontinuity in spectrums, an unnatural change of prosody (i.e., the pitch and duration) of the speech unit, and an alteration in the size of a waveform of the speech.
In the meantime, two methods are used to remove a discontinuity that occurs at the transition portion between the concatenated speech units of a synthesized speech. For a first method, a difference in the characteristics between the speech units to be concatenated is previously measured during the selection of speech units, and then the speech units are selected in such a fashion that the difference is minimized. For a second one, a smoothing technique is applied to the transition portion between concatenated speech units of a synthesized speech.
Steady research has been conducted for the first method, and recently, a minimization technique of a discontinuous distortion reflecting the characteristic of an ear has been developed, which is successfully applied to the TTS. On the other hand, research has not been actively conducted for the second method compared with the first method. The reason for this is that the smoothing technique is regarded as a more important factor in speech coding technology than in speech synthesis based on a signal processing technology, and that the smoothing technique itself may cause a distortion in speech signals.
Recently, a smoothing method applied to a speech synthesizer generally uses a method used in a speech coding.
FIG. 1 is a table illustrating the results for distortions in terms of both naturalness and intelligibility when various smoothing methods applicable to a speech coding are applied to a speech synthesis, wherein the applied smoothing methods include WI-base method, LP-pole method and continuity effects method.
Referring to FIG. 1, it can be found that distortion values in naturalness and intelligibility are smaller when not applying a smoothing method (i.e., no smoothing) than when applying various smoothing methods, resulting in exhibition of a superior speech quality incase of no smoothing (see CHEN, Stanley F., “A Survey of Smoothing Techniques for ME Models,” 8 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, pp. 37-50 Vol. 8, No. 1, January 2000. Consequently, it can be seen that since the case of not applying a smoothing method to a speech synthesis is more effective than that of applying the smoothing method to that, it is inappropriate to apply the smooth method applied to a speech coder to the speech synthesizer.
A distortion largely occurs owing to a quantization error, etc., in the speech coder. At this time, a smoothing method is also used to minimize the quantization error, etc. However, since a recorded speech signal itself is used in the speech synthesizer, there does not exist the quantization error as in the speech coder. The distortion occurs due to the erroneous selection of speech units, or rapid fluctuations and discontinuity in spectrums between speech units. That is, since the speech coder and the speech synthesizer are different from each other in terms of the cause of inducing a distortion, the smoothing method applied to the speech coder is not effective in the speech synthesizer.