This invention relates to an speech synthesizing apparatus having a database for managing phoneme data, in which the apparatus performs speech synthesis using the phoneme data managed by the database. The invention further relates to a method of synthesizing speech using this apparatus, and to a storage medium storing a program for implementing this method.
A method of speech synthesis which concatenates waveform (which will be referred to as the “Concatenative synthesis method” below) is available in the prior art as a method of synthesizing speech. The Concatenative synthesis method changes prosody with a Pitch synchronous overlap adding method (P-SOLA) which changes prosody by placing pitch waveform units extracted from the original waveform unit in conformity with a desired pitch timing. An advantage of the Concatenative synthesis method is that the synthesized speech obtained is more natural than that provided by a synthesis method based upon parameters. A disadvantage is that the allowable range for the change in prosody is narrow.
Accordingly, sound quality is improved by preparing speech data of a wide variety of variations, selecting these properly and using them. Information such as the phoneme environment (the phoneme that is the object of synthesis or several phonemes including both sides thereof) and the fundamental frequency F0 is used as the criteria for selecting the synthesis unit.
However, the conventional method of synthesizing speech described above involves a number of problems.
By way of example, if a database contains a plurality of items of phoneme data which satisfy a certain phoneme environment and the fundamental frequency F0, the phoneme unit used in synthesis is one phoneme unit (e.g., the phoneme unit that appears in the database first) selected randomly from these items of phoneme data. Since the database is a collection of speech uttered by human beings, all of the phoneme data is not necessarily stable (i.e., not necessarily of good quality). The database may contain phoneme data that is the result of mumbling, a halting voice, slowness of speech or hoarseness. If one item of phoneme data is selected randomly from such a collection of data, naturally there is the possibility that sound quality will decline when synthesized speech is generated.