1. Field of the Invention
The present invention relates to speech synthesis, and more particularly, to a method and apparatus for generating a dialog prosody structure capable of expressing the focus of a dialog or the intention of a speaker by using information obtained through discourse analysis between user utterances and system utterances, and a speech synthesis method and system employing the method and apparatus.
2. Description of Related Art
A speech synthesizer is a text-to-speech (TTS) conversion apparatus that converts a character string, i.e., a sentence, into speech, and is used in a variety of platforms such as a personal computer (PC), a personal digital assistant (PDA), and a mobile phone. Speech synthesizers are applied in a variety of fields, including communications, such as in a unified messaging system (UMS) which reads email and character messages, and information retrieval, such as in speech browsing which outputs a web document, a database (DB) search result, and a system message in the form of speech.
The speech synthesizer generally performs three steps: language processing, prosody generation, and synthesized speech generation. Among these steps, prosody generation means generation of information on an utterance phase, a silent interval, a segmental duration, a segmental amplitude, a pitch pattern, and the like, in relation to an input sentence. Here, the prosody includes intonation, rhythm, accent, etc., and is a characteristic of speech that expresses meaning, emphasis, emotion, etc., without changing the unique characteristics of phonemes. Speech without at least a simple prosody cannot convey meaning exactly and moreover is dull to listen to.
Among methods tried so far in order to generate more natural prosodies, methods disclosed in U.S. Patent Publication Nos. 20030163314 and 20030078780, and Japanese Laid-Open Patent Publication Nos. 1995-199981 and 2002-031198 are noteworthy. In U.S. Patent Publication No. 20030163314, the theme of a document is determined according to semantic information, and one speaking style is selected from a group of predefined speaking styles corresponding to themes and is used to reflect a prosody. In U.S. Patent Publication No. 20030078780, a prosody characteristic is selected from among prosody characteristics shown in spoken dialog to express the characteristic of a prosody which appears repeatedly in a predetermined part or in the speaking style. In Japanese Patent Publication No. 1995-199981, considering speed or emphasis in a speech, it is determined whether an accented part of a compound word is separated or unified, in order to improve naturalness and clarity of synthesized speech. In Japanese Patent Publication No. 2002-031198, syntactic interpretation information is obtained not in units of polymorphemes but in units of accent phrases, and prosody information is set.
However, according to the above methods, a prosody structure is set by using only information in a sentence based on syntactic analysis or semantic analysis and, as a result, sentences with identical structures always have prosody structures formed with identical intonation or emphasis. Accordingly, it is difficult to express the focus of a dialog or the intention of a speaker by the prosody structure, and there is a limit to generating natural-sounding synthesized speech.