The present invention relates to speech synthesis. In particular, the present invention relates to prosody in speech synthesis.
Text-to-speech technology allows computerized systems to communicate with users through synthesized speech. The quality of these systems is typically measured by how natural or human-like the synthesized speech sounds.
Very natural sounding speech can be produced by simply replaying a recording of an entire sentence or paragraph of speech. However, the complexities of human languages and the limitations of computer storage make it impossible to store every conceivable sentence that may occur in a text. Because of this, the art has adopted a concatenative approach to speech synthesis that can be used to generate speech from any text. This concatenative approach combines stored speech samples representing small speech units such as phonemes, diphones, triphones, or syllables to form a larger speech signal.
One problem with such concatenative systems is that a stored speech sample has a pitch and duration that is set by the context in which the sample was spoken. For example, in the sentence “Joe went to the store” the speech units associated with the word “store” have a lower pitch than in the question “Joe went to the store?” Because of this, if stored samples are simply retrieved without reference to their pitch or duration, some of the samples will have the wrong pitch and/or duration for the sentence resulting in unnatural sounding speech.
One technique for overcoming this is to identify the proper pitch and duration for each sample. Based on this prosody information, a particular sample may be selected and/or modified to match the target pitch and duration.
Identifying the proper pitch and duration is known as prosody prediction. Typically, it involves generating a model that describes the most likely pitch and duration for each speech unit given some text. The result of this prediction is a set of numerical targets for the pitch and duration of each speech segment.
These targets can then be used to select and/or modify a stored speech segment. For example, the targets can be used to first select the speech segment that has the closest pitch and duration to the target pitch and duration. This segment can then be used directly or can be further modified to better match the target values.
For example, one prior art technique for modifying the prosody of speech segments is the so-called Time-Domain Pitch-Synchronous Overlap-and-Add (TD-PSOLA) technique, which is described in “Pitch-Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis using Diphones”, E. Moulines and F. Charpentier, Speech Communication, vol. 9, no. 5, pp. 453-467, 1990. Using this technique, the prior art increases the pitch of a speech segment by identifying a section of the speech segment responsible for the pitch. This section is a complex waveform that is a sum of sinusoids at multiples of a fundamental frequency F0. The pitch period is defined by the distance between two pitch peaks in the waveform.
To increase the pitch, the prior art copies a segment of the complex waveform that is as long as the pitch period. This copied segment is then shifted by some portion of the pitch period and reinserted into the waveform. For example, to double the pitch, the copied segment would be shifted by one-half the pitch period, thereby inserting a new peak half-way between two existing peaks and cutting the pitch period in half.
To lengthen a speech segment, the prior art copies a section of the speech segment and inserts the copy into the complex waveform. In other words, the entire portion of the speech segment after the copied segment is time-shifted by the length of the copied section so that the duration of the speech unit increases.
Unfortunately, these techniques for modifying the prosody of a speech unit have not produced completely satisfactory results. In particular, these modification techniques tend to produce mechanical or “buzzy” sounding speech.
Thus, it would be desirable to be able to select a stored unit that provides good prosody without modification. However, because of memory limitations, samples cannot be stored for all of the possible prosodic contexts in which a speech unit may be used. Instead, a limited set of samples must be selected for storage. Because of this, the performance of a system that uses stored samples without prosody modification is dependent on what samples are stored.
Thus, there is an ongoing need for improving the selection of these stored samples in systems that do not modify the prosody of the stored samples. There is also an ongoing need to reduce the computational complexity associated with identifying the proper prosody for the speech units.