The present invention relates to a method and apparatus for editing/creating synthetic speech messages and a recording medium with the method recorded thereon. More particularly, the invention pertains to a speech message editing/creating method that permits easy and fast synthesization of speech messages with desired prosodic features.
Dialogue speech conveys speaker's mental states, intentions and the like as well as the linguistic meaning of spoken dialogue. Such information contained in the speaker's voices, except their linguistic meaning, is commonly referred to as non-verbal information. The hearer takes in the non-verbal information from the intonation, accents and duration of the utterance being made. There has heretofore been researched and developed, as what is called a TTS (Text-To-Speech) message synthesis method, a "speech synthesis-by-rule" that converts a text to speech form. Unlike in the case of editing and synthesizing recorded speech, this method places no particular limitations on the output speech and settles the problem of requiring the original speaker's voice for subsequent partial modification of the message. Since the prosody generation rules used are based on prosodic features of speech made in a recitation tone, however, it is inevitable that the synthesized speech becomes recitation-type and hence is monotonous. In natural conversations the prosodic features of dialogue speech often significantly vary with the speaker's mental states and intentions.
With a view to making the speech synthesized by rule sound more natural, an attempt has been made to edit the prosodic features, but such editing operations are difficult to automate; conventionally, it is necessary for a user to perform edits based on his experience and knowledge. In the edits it is hard to adopt an arrangement or configuration for arbitrarily correcting prosodic parameters such as intonation, fundamental frequency (pitch), amplitude value (power) and duration of an utterance unit desired to synthesize. Accordingly, it is difficult to obtain a speech message with desired prosodic features by arbitrarily correcting prosodic or phonological parameters of that portion in the synthesized speech which sounds monotonous and hence recitative.
To facilitate the correction of prosodic parameters, there has also been proposed a method using GUI (graphic user interface) that displays prosodic parameters of synthesized speech in graphic form on a display, visually corrects and modifies them using a mouse or similar pointing tool and synthesizes a speech message with desired non-verbal information while confirming the corrections and modifications through utilization of the synthesized speech output. Since this method visually corrects the prosodic parameters, however, the actual parameter correcting operation requires experience and knowledge of phonetics, and hence is difficult for an ordinary operator.
In any of U.S. Pat. No. 4,907,279 and Japanese Patent Application Laid-Open Nos. 5-307396, 3-189697 and 5-19780 there is disclosed a method that inserts phonological parameter control commands such as accents and pauses in a text and edits synthesized speech through the use of such control commands. With this method, too, the non-verbal information editing operation is still difficult for a person who has no knowledge about the relationship between the non-verbal information and prosody control.