Text-To-Speech technology allows computerized systems to communicate with users through synthesized speech. The quality of these systems is typically measured by how natural or human-like the synthesized speech sounds.
Very natural sounding speech can be produced by simply replaying a recording of an entire sentence or paragraph of speech. However, the complexity of human communication through languages and the limitations of computer storage may make it impossible to store every conceivable sentence that may occur in a text. Because of this, the art has adopted a concatenative approach to speech synthesis that can be used to generate speech from any text. This concatenative approach combines stored speech samples representing small speech units such as phonemes, diphones, triphones, or syllables to form a larger speech signal.
One problem with such concatenative systems is that a stored speech sample has a pitch and duration that is set by the context in which the sample was spoken. For example, in the sentence “Joe went to the store” the speech units associated with the word “store” have a lower pitch than in the question “Joe went to the store?” Because of this, if stored samples are simply retrieved without reference to their pitch or duration, some of the samples will have the wrong pitch and/or duration for the sentence resulting in unnatural sounding speech.
One technique for overcoming this is to identify the proper pitch and duration for each sample. Based on this prosody information, a particular sample may be selected and/or modified to match the target pitch and duration.
Identifying the proper pitch and duration is known as prosody prediction. Typically, it involves generating a model that describes the most likely pitch and duration for each speech unit given some text. The result of this prediction is a set of numerical targets for the pitch and duration of each speech segment. An example for a prosody predictor may be found at www.bach.arts.kuleuven.be/pmertens/prosody/mingus.html and references cited therein.
These targets can then be used to select and/or modify a stored speech segment. For example, the targets can be used to first select the speech segment that has the closest pitch and duration to the target pitch and duration. This segment can then be used directly or can be further modified to better match the target values.
For example, one technique for modifying the prosody of speech segments is the so-called Time-Domain Pitch-Synchronous Overlap-and-Add (TD-PSOLA) technique, which is described in “Pitch-Synchronous Waveform Processing Techniques for Text-To-Speech Synthesis using Diphones”, E. Moulines and F. Charpentier, Speech Communication, vol. 9, no. 5, pp. 453-467, 1990, the contents of which is incorporated herein by reference.
Unfortunately, existing techniques for modifying the prosody of a speech unit have not produced completely satisfactory results. In particular, these modification techniques tend to produce mechanical or “buzzy” sounding speech, especially, when the difference between the required prosody and the recorded one is large.
Thus, it would be desirable to be able to select a stored unit that provides good prosody without modification or only with minimal modification.
However, because of memory limitations, samples cannot be stored for all of the possible prosodic contexts in which a speech unit may be used. Instead, a limited set of samples must be selected for storage. Because of this, the performance of a system that uses stored samples with limited prosody modification is dependent on what samples are stored.
US patent application publication No. 2004/0148171, assigned to Microsoft, suggests dealing with this problem by recording a very large corpus, for instance, a corpus containing about 97 million Chinese Characters, and selecting from this corpus a limited set of sentences, identified to include the most necessary ‘context vectors’. Only speech samples from the selected units are stored.
U.S. Pat. No. 6,829,581 discloses synthesizing speech by a synthesizer based on prosody prediction rules, and then asking a reader to imitate the synthesized speech. The reader is asked to preserve the nuance of the utterance as spoken by the synthesizer and to follow the location of the peaks and dips in the intonation while trying to still sound natural. The speaker sees the text of the sentence, hears it synthesized two to three times, and records it. Speech segments taken from speech recorded in this way are concatentated to synthesize speech of other sentences. The method is described in the patent as circumventing the need to concatenate dissimilar speech units to each other.
U.S. Pat. No. 5,915,237 discloses a speech encoding system for encoding a digitized speech signal into a standard digital format, such as MIDI.
US Patent Application Publication No. 2006/0069567 describe TTS systems based on voice-files, comprising speech samples taken from words spoken by a particular speaker. In one example, the speaker reads the words from a pronunciation dictionary.