1. Field of the Invention
The present invention relates to a prosody modification device including a real voice prosody input part that receives real voice prosody information extracted from an utterance of a human and a real voice prosody modification part that modifies the real voice prosody information received by the real voice prosody input part, a prosody modification method, and a recording medium storing a prosody modification program.
2. Description of Related Art
In recent years, various systems or apparatuses use a speech synthesis technology of converting character strings (text) into speech and outputting the obtained speech. For example, this technology is applied to IVR (Interactive Voice Response) systems, in-vehicle information terminals, and mobile phones so as to read guidance on an operating method or mail, support systems for visually impaired persons and speech impaired persons, and the like. However, with the current state of the speech synthesis technology, it is difficult to generate synthetic speech that is as natural and expressive as a human real voice.
The prosody of synthetic speech generally is determined by performing processes such as a morphogical analysis, i.e., an analysis of reading and a part of speech of a word in a character string, an analysis of a clause and a modification relation, the setting of an accent, an intonation, a pause, and a rate of speech, and the like. With the current state of processing technology, however, it is difficult to perform an analysis taking into consideration the meaning of a sentence and a context as accurately as a human, and an error may be involved in a result of the analysis. As a result, the prosody, which determines a manner of speaking such as a voice pitch, an intonation, a rhythm, and the like, of synthetic speech generated by the speech synthesis technology partially may be unnatural as compared with a human real voice.
To solve the above-described problem, the following method for improved quality of the prosody of synthetic speech is known. In the case where a character string to be converted into synthetic speech is predetermined, prosody information is extracted from an utterance of a human, and the synthetic speech is generated by using the extracted prosody information of a real voice as it is (for example, see JP 10(1998)-153998 A, JP 9(1997)-292897 A, JP 11(1999)-143483 A, and JP 7(1995)-140996 A). In this method, while the operation of extracting the human utterance and its prosody is required in advance, it is possible to generate synthetic speech as natural and expressive as a human real voice since the synthetic speech is generated by using the prosody information of the real voice extracted from the human utterance.
Meanwhile, in order to extract the prosody information from the human utterance, a phoneme boundary is set for each phoneme either by a manual operation or automatically by using DP (Dynamic Programming) matching, HMM (Hidden Markov Model), or the like.
In the former case, it is required that a human visually discriminates a phoneme boundary for each phoneme based on a displayed speech waveform to set the phoneme boundary, for example. This operation requires expert knowledge about speech and takes time and trouble.
On the other hand, in the latter case, the prosody information may be extracted erroneously, which means that an erroneous phoneme boundary is set. Even by using DP matching, HMM, or the like, it is sometimes difficult to set a correct phoneme boundary due to similar sounds and noises. When the prosody information is extracted from a real voice erroneously, prosodically unnatural synthetic speech is generated. Consequently, it is required to modify the erroneously extracted prosody information. In order to modify the erroneously extracted prosody information, it is required after all that a human visually confirms the automatically set phoneme boundary, and modifies the erroneously set phoneme boundary. This operation also requires expert knowledge about speech and takes time and trouble as in the former case.