1. Field of the Invention
The present invention relates to a system for synchronization between moving picture and a text-to-speech(TTS) converter, and more particulary to a system for synchronization between moving picture and a text-to-speech converter which can be realized a synchronization between moving picture and synthesized speech by using the moving time of lip and duration of speech information.
2. Description of the Related Art
In general, a speech synthesizer provides a user with various types of information in an audible form. For this purpose, the speech synthesizer should provide a high quality speech synthesis service from the input texts given to a user. In addition, in order for the speech synthesizer to be operatively coupled to a database constructed in a multi-media environment, or various media provided by a counterpart involved in a conversation, the speech synthesizer can generate a synthesized speech so as to be synchronized with these media. In particular, the synchronization between moving picture and the TTS is essentially required to provide a user with a high quality service.
FIG. 1 shows a block diagram of a conventional text-to-speech converter which generally consists of three steps in generating a synthesized speech from the input text.
At step 1, a language processing unit 1 converts an input text to a phoneme string, estimates prosodic information, and symbolizes it. The symbol of the prosodic information is estimated from the phrase boundary, clause boundary, accent position, sentence patterns, etc. by analyzing a syntactic structure. At step 2. a prosody processing unit 2 calculates the values for prosody control parameters from the symbolized prosodic information by using rules and tables. The prosody control parameters include phoneme duration and pause interval information. Finally, a signal processing unit 3 generates a synthesized speech by using a synthesis unit DB 4 and the prosody control parameters. That is, the conventional synthesizer should estimate prosodic information related to naturalness and speaking rate only from an input text in the language processing unit 1 and the prosody processing unit 2.
Presently, a lot of researches on the TTS have been conducted through the world for application to mother languages, and some countries have already started a commercial service. However, the conventional synthesizer is aimed at its use in synthesizing a speech from an input text, and thus there is no research activity on a synthesizing method which can be used in connection with multi-media. In addition, when dubbing is performed on moving picture or animation by using the conventional TTS method, information required to implement the synchronization of media with a synthesized speech cannot be estimated from the text only. Thus, it is not possible to generate a synthesized speech, which is smoothly and operatively coupled to moving pictures, from only text information.
If the synchronization between moving picture and a synthesized speech is assumed to be a kind of dubbing, there can be three implementation methods. One of these methods includes a method of synchronizing moving picture with a synthesized speech on a sentence basis. This method regulates the time duration of the synthesized speech by using information on the start point and end point of the sentence. This method has an advantage that it is easy to implement and the additional efforts can be minimized. However, the smooth synchronization cannot be achieved with this method. As an alternative, there is a method wherein information on the start and end point, and phoneme symbol for every phoneme are transcribed in the interval of the moving picture related to a speech signal to be used in generating a synthesized speech. Since the synchronization of moving picture with a synthesized speech can be achieved for each phoneme with this method, the accuracy can be enhanced. However, this method has a disadvantage that additional efforts should be exerted to detect and record time duration information for every phoneme in a speech interval of the moving picture.
As another alternative, there is a method wherein synchronization information is recorded based on patterns having the characteristic by which a lip motion can be easily distinguished, such as the start and end points of the speech, the opening and closing of the lip, protrusion of the lip, etc. This method can enhance the efficiency of synchronization while minimizing the additional efforts exerted to make information for synchronization.