1. Field of the Invention
The present invention relates to synthesizing speech from text. In particular, the invention relates to prosodic control which controls intonation and duration of a sentence.
2. Description of the Related Art
In general, text to speech synthesis is performed by the following procedure. First, text to be synthesized is inputted and intermediate phonetic symbol sequences are produced. Then, prosodic parameters and vocal tract transfer functions are acquired on the basis of the intermediate phonetic symbol sequences. The prosodic parameter may be a fundamental frequency pattern or the duration of a phoneme. Synthetic speech is subsequently obtained by use of these parameters. For instance, a speech synthesis system is described in Keikichi Hirose, xe2x80x9cSpeech Synthesis Technologyxe2x80x9d, Speech Processing Technology and its Applications, Information Processing, pages 984-991 (November 1997).
If the procedure described above is used, the prosodic parameters determine naturalness relating to intonation, rhythm and smoothness of the speech and the vocal tract transfer functions determine the intelligibility of individual syllables that make up a word or a sentence.
Among the prosodic parameters, the xe2x80x9cadded-type modelxe2x80x9d is a typical model for generating fundamental frequency pattern parameters. The generation model of this fundamental frequency pattern adds a rising or falling accent component to the fundamental frequency, e.g. corresponding to an accent type for a sentence syllable to a phrase component where a fundamental frequency goes down smoothly in response to a phrase. Although the added-type model is easy to be understood intuitively and matches with an actual speech phenomenon because this model imitates a human vocalization structure, there is a problem that sophisticated language processing is required to make this model work.
The duration of a phoneme as a prosodic parameter, depends on the context in which the phoneme is placed, ie. the context of the syllable. There are many factors which affect the duration of the phoneme such as modulation constraints, timing, importance of a word, indication of speech boundaries, tempo within speech areas, and syntactical meaning. Statistical analysis is typically performed against actual measurements of duration time data in order to determine the degree to which each of these factors affects duration, and the rules thus obtained are applied. However, maintaining the large-scale database that is needed to construct duration modules in a variety of contexts is a problem.
Apart from these prosodic parameters, there have been proposals for a variety of control modes for power-related parameters. However, all of these models are prosodic parameter independent models, and there is a natural limit to the extent to which the performance of these independent control models can be improved. It has been pointed out that the modeling of sentence speech according to rules is difficult according to a prosodic phenomenon.
The creation of a database built from prosodic parameters selected from natural speech has been proposed. The database would be used by a prosodic parameter model to calculate prosodic parameters, as proposed, for instance, in Katae et al, xe2x80x9cA Domain Specific Text-to-Speech System Using a Prosody Database Retrieved with a Sentence Structurexe2x80x9d, Studies in Sound, pages 275-276 (March 1996); or in Saito et al, xe2x80x9cA Rule-Based Speech Synthesis Method Using Fuzokugo-Sequence Unitxe2x80x9d, Studies in Sound, pages 317-319 (June 1998). However, these publications introduce only the fundamental frequency pattern as a prosodic parameter and are insufficient for improving the naturalness of sentence speech (speaking in sentences).
The present invention relates to a speech synthesis system for synthesizing an improved speech having a natural characteristic by editing and processing each prosodic parameter (fundamental frequency pattern, the duration of phoneme, etc.) of natural speech.
The present invention provides a text speech synthesis system for synthesizing a speech having an improved natural characteristic as compared with the conventional method by: providing a speech corpus that includes a speech sentence, prosodic parameters of the speech sentence and morphological element/structured sentence analysis data; abstracting data wherein a similarity degree with an input sentence becomes largest by searching the speech corpus; creating and correcting prosodic parameters for the abstracted data; and thereby producing prosodic parameters to be used in the synthesizing.