In recent years, along with the development of a speech synthesis technique which synthesizes speech from text, natural synthetic speech close to human voice production can be obtained.
A recent speech synthesis system generally uses a method of learning prosody or voice quality statistical model from a speech corpus of recorded human speech data. For example, as a prosody statistical model, a decision tree model, hidden Markov model, and the like are known. Using these statistical models, intonation of arbitrary text which is not included in a learning corpus can be reproduced naturally to some extent.
However, since the statistical model learns average prosodic features from many utterances in the speech corpus, intonation of synthetic speech generated from the statistical model tends to be monotonic. Hence, a system which visually presents a prosodic pattern generated by the statistical model to the user, and allows the user to graphically edit the pattern using a device such as a mouse is known.