In video contents used in TV programs, etc., synthetic speeches has started taking the place of recorded speeches of narrators or voice actors, as speeches of narrations, speeches in animated cartoons and dubbed foreign pictures, etc.
In the production of video contents, a video editing operation called “post-production” is carried out, which is an operation for editing filmed video materials and recorded audio materials and compiling them into one work. This post-production is usually a non-linear editing operation using a computer at present. This editing operation is carried out by non-linearly adding, deleting, revising, and rearranging video materials and audio materials placed on a memory device such as a hard disk, using hardware controlled by software for video editing (hereinafter referred to as a “video editing system”). This enables efficient production of video contents. In such an editing operation using the video editing system, a producer has to arrange video and audio at a desired time position while synchronizing the video and the audio with each other. As a method, a device, and a program for synchronizing video and audio that can be used in the case where synthetic speeches are used as audio materials, there have been several known examples of those available (see, for example, Patent Documents 1 to 3).
Patent Document 1 discloses a speech synthesis controlling device capable of easily synchronizing a synthetic speech with a video signal at a predetermined display time. This speech synthesis controlling device controls the start of speech synthesis by obtaining, as a speech start timing, a speech start position in a text to be read aloud, and a speech start time, and outputting the speech start timing to the speech synthesis device.
Patent Document 2 discloses a synchronization control device for synchronizing synthetic speeches and videos such as computer graphics with each other. This is intended to generate prosodic data for controlling prosody of a synthetic speech based on input data containing text information and action designation information that designate a basic action of a video, and generating video control data containing action designation information and time information that associates time with an action designated by the action designation information, based on the input data and the prosodic data.
Patent Document 3 discloses an exemplary case where a synthetic speech is used indirectly, when audio materials are prepared with use of recorded speeches for video contents. In the method disclosed by Patent Document 3, text information of speech that is to be recorded is added to a video section in which the speech is to be input, whereby a synthetic speech is produced. After a duration, a pitch, intonation, a timbre, a timing of utterance, etc. of a synthetic speech are processed, the processed synthetic speech is reproduced toward a person who is to utter the speech, in order to display the timing for uttering the speech. The speech uttered by the person and the processed synthetic speech are compared as to the agreement therebetween, and whether the recorded speech is to be used or not is determined. A recorded speech that is determined to be used is combined with a video section.
[Patent Document 1] Japanese Laid-open Patent Publication No. 2005-309173
[Patent Document 2] Japanese Laid-open Patent Publication No. 2003-216173
[Patent Document 3] Japanese Laid-open Patent Publication No. 11 (1999)-308565
For example, when a synthetic speech is produced from an input text and is synchronized with a video in a video editing system, a user cannot intuitively know the duration of a speech to be synthesized, he/she sometimes gives too much or too little text to a desired speech duration. As a result, a speech synthesized from the text is sometimes too long or too short, and it is difficult to synchronize the speech with the video.
In Patent Document 1, a text for which a speech start position and a speech start time are set is prepared, so that the synchronization of the speech with a video is achieved. In this configuration, for example, if an excessively long text is described with respect to a time section interposed between one speech start position and a next speech start position, then a high-speed synthetic speech is prepared so that the speech of the text should be fit in this time section. Further, if a short text is described with respect to a time section, then a low-speed, spread speech, or an unnatural speech with many pauses, is prepared.
The synchronization control disclosed in Patent Document 2, in which a fixed-duration set of video data is prepared, cannot be adopted in the case where a speech that matches the duration has to be prepared.
According to the method disclosed in Patent Document 3, if a text to be synthesized is too short with respect to a video, only a slovenly spread speech is produced, whereas if a text to be synthesized is too short with respect to a video, it cannot be helped to produce a high-speech speech. As a result, only an unnatural speech is prepared.
Thus, a conventional technique does not have a mechanism for allowing a user to intuitively know a duration of a speech to be synthesized from an input text, which results in a problem of difficulty in synchronizing a video and a speech with each other. It should be noted that this problem occurs not only in the case where a video and a speech are synchronized, but also, for example, in the case where a user inputs a text for a synthetic speech corresponding to a desired duration.