In a voice synthesis field, as effective methods for achieving desired synthesized voices including expressions such as various expressions of feelings, methods are known that produce voice waveforms of synthesized voices based on of tagged texts. The tagged text is composed of text serving as a target of voice synthesis and tag information described by a markup language and added to the text. The tag information is used for controlling voice synthesis of the text interposed with tags. A voice synthesis engine can obtain a desired synthesized voice by selecting a dictionary used for voice synthesis and adjusting a prosody parameter based on the tag information, for example.
A user can produce the tagged text by adding the tag information to text using an editor. This manner, however, requires the user to perform complicated work. It is, thus, common to produce the tagged text by applying a preliminarily produced template to the text serving as the target of voice synthesis.
The conventional common manner, however, requires enormous man-hours for preparation because a large number of templates need to be produced for various types of tag information. A technique is available that automatically produces templates by machine learning. This technique, however, needs to additionally prepare training data and correct answer data for the machine learning, thereby requiring complicated work. It is, thus, desired to establish a new mechanism for efficiently produce the tagged text.