1. Field of the Invention
The present invention relates to a text-to-speech conversion system (hereinafter, referred to as TTS) for interlocking with multimedia and a method for organizing input data of the same, and more particularly to a text-to-speech conversion system (TTS) for interlocking with multimedia and a method for organizing input data of the same for enhancing the natural of synthesized speech and accomplishing synchronization between multimedia and TTS by defining additional prosody information, the information required to interlock TTS with multimedia, and interface between these information and TTS for use in the production of the synthesized speech.
2. Description of the Related Art
Generally, the function of the speech synthesizer is to provide different forms of information for a man using a computer. To this end, the speech synthesizer should serve the user with synthesized speech with high quality from a given text. In addition, for the interlock with database produced in multimedia environment such as moving picture or animation, or a variety of media provided from a counterpart of conversion, the speech synthesizer should produce the synthesized speech to be synchronized with theses media. Particularly, the synchronization of TTS with multimedia is essential to provide the user with service with high quality.
As shown in FIG. 1, typically, a conventional TTS goes through the process consisting of 3 steps as follows until the synthesized speech is produced from on inputted text.
In a first step, a language processor 1 converts the text into a series of phoneme, presumes prosody information and symbolizes this information. Symbol of prosody information is presumed from a boundary of the phrase and paragraph, a location of accent in word, a sentence pattern, and so on using the analysis result of syntax.
In a second step, a prosody processor 2 calculates a value of prosody control parameter from the symbolized prosody information using a rule and a table. Prosody control parameter includes duration of phoneme, pitch contour, energy contour, and pause interval information.
In a third step, a signal processor 3 produces a synthesized speech using a synthesis unit database 4 and the prosody control parameter. In other words, this means that the conventional TTS should presume the information associated with the natural and speech rate in the language processor 1 and the prosody processor 2 only by the inputted text.
Further, the conventional TTS has simple function to output data inputted by the unit of sentence as the synthesized speech. Accordingly, in order to output sentences stored in a file or sentences inputted through a communication network as the synthesized speech in succession, a main control program which reads sentences from the inputted data and transmits them to an input of TTS is required. Such a main control program includes a method to separate the text from the inputted data and then output the synthesized speech once from the beginning to the end, a method to produce the synthesized speech in interlock with a text editor, a method to look up the sentences by use of a graphic interface and produce the synthesized speech, and so on, but the object to which these methods are applicable is restricted to the text.
At present, studies on TTS have considerably advanced for the vernacular language in different countries and a commercial use has been accomplished in some countries. However, this is in situation of the only use for the syntheses of speech from the inputted text. In addition, by a prior organization, since it is impossible to presume from only the text the information required when moving picture is to be dubbed by use of TTS or when the natural interlock between the synthesized speech and multimedia such as animation is to be implemented, there is no method to realize these functions. Furthermore, there is also no result of the studies on use of additional data for enhancement of the natural in the synthesized speech and organization of these data.