For the purpose of generating synthetic speech that sounds natural to a listener, a speech synthesis technique employing a waveform editing and synthesizing method has been used heretofore. In this method, a speech synthesizer apparatus records human speech and waveforms of the speech are stored as speech waveform data in a data base, in advance. Then, the speech synthesizer apparatus generates synthetic speech, also referred to as synthesized speech, by reading and connecting multiple speech waveform data pieces in accordance with an inputted text. It is preferable that the frequency and tone of speech continuously change in order to make such synthetic speech sound natural to a listener. For example, when the frequency and tone of speech largely changes in a part where speech waveform data pieces are connected to each other, the resultant synthetic speech sounds unnatural.
However, there is a limitation on types of speech waveform data that are recorded in advance because of cost and time constraints, and limitations of the storage capacity and processing performance of a computer. For this reason, in some cases, a substitute speech waveform data piece is used instead of the proper data piece to generate a certain part of the synthesized speech since the proper data piece is not registered in the database. This may consequently cause the frequency and the like in the connected part to change so much that the synthesized speech sounds unnatural. This case is more likely to happen when the content of inputted text is largely different from the content of speech recorded in advance for generating the speech waveform data pieces.
A speech output apparatus disclosed in Japanese Patent Application Laid-open Publication No. 2003-131679 makes a text more understandable to a listener by converting the text composed of phrases in a written language into a text in a spoken language, and then by reading the resultant text aloud. However, this apparatus is only for converting the expression of a text from the written language to the spoken language, and this conversion is performed independently of information on frequency changes and the like in speech wave data. Accordingly, this conversion does not contribute to a quality improvement of synthetic speech, itself. In a technique described in Wael Hamza, Raimo Bakis, and Ellen Eide, “RECONCILING PRONUNCIATION DIFFERENCES BETWEEN THE FRONT-END AND BACK-END IN THE IBM SPEECH SYNTHESIS SYSTEM,” Proceedings of ICSLP, Jeju, South Korea, 2004, pp. 2561-2564, multiple phonemes that are pronounced differently but written in the same manner are stored in advance, and an appropriate phoneme segment among the multiple phoneme segments is selected so that the synthesized speech can be improved in quality. However, even by making such a selection, the resultant syntheized speech sounds unnatural if an appropriate phoneme segment is not included in those stored in advance.