The present invention relates to speech synthesis. In particular, the present invention relates to time and pitch scaling in speech synthesis.
Text-to-speech systems have been developed to allow computerized systems to communicate with users through synthesized speech. Concatenative speech synthesis systems convert input text into speech by generating small speech segments for small units of the text. These small speech segments are then concatenated together to form the complete speech signal.
To create the small speech segments, a text-to-speech system accesses a database that contains samples of a human trainer's voice. The samples are generally grouped in the database according to the speech units they are taken from. In many systems, the speech units are phonemes, which are associated with the individual sounds of speech. However, other systems use diphones (two phonemes) or triphones (three phonemes) as the basis for their database.
The number of bits that can be used to describe each sample for each speech unit is limited by the memory of the system. Thus, text-to-speech systems generally cannot store values that exactly describe the training speech units. Instead, text-to-speech systems only store values that approximate the training speech units. This causes an approximation error in the stored samples, which is sometimes referred to as a compression error.
The number of examples of each speech unit that can be stored for the speech system is also limited by the memory of the computer system. Different examples of each speech unit are needed because the speech units change slightly depending on their position within a sentence and their proximity to other speech units. In particular, the pitch and duration of the speech unit, also known as the prosody of the speech unit, will change significantly depending on the speech unit's location. For example, in the sentence "Joe went to the store" the speech units associated with the word "store" have a lower pitch than in the question "Joe went to the store?"
Since the number of examples that can be stored for each speech unit is limited, a stored example may not always match the prosody of its surrounding speech units when it is combined with other units. In addition, the transition between concatenated speech units is sometimes discontinuous because the speech units have been taken from different parts of the training session.
To correct these problems, the prior art has developed techniques for changing the pitch and duration of a stored speech unit so that the speech unit better fits the context in which it is being used. An example of one such prior art technique is the so-called Time-Domain Pitch-Synchronous Overlap-and-Add (TD-PSOLA) technique, which is described in "Pitch-Synchronous Waveform Processing Techniques for Text-to-Speech Synthesis using Diphones", E. Moulines and F. Charpentier, Speech Communication, vol. 9, no. 5, pp. 453-467, 1990. Using this technique, the prior art increases the pitch of a speech unit by identifying a section of the speech unit responsible for the pitch. This section is a complex waveform that is a sum of sinusoids at multiples of a fundamental frequency F.sub.0. The pitch period is defined by the distance between two pitch peaks in the waveform. To increase the pitch, the prior art copies a segment of the complex waveform that is as long as the pitch period. This copied segment is then shifted by some portion of the pitch period and reinserted into the waveform. For example, to double the pitch, the copied segment would be shifted by one-half the pitch period, thereby inserting a new peak half-way between two existing peaks and cutting the pitch period in half.
To lengthen a speech unit, the prior art copies a section of the speech unit and inserts the copy into the complex waveform. In other words, the entire portion of the speech unit after the copied segment is time-shifted by the length of the copied segment so that the duration of the speech unit increases.
Unfortunately, these techniques for modifying the prosody of a speech unit have not produced completely satisfactory results. As such, a new technique is needed for modifying the pitch and duration of speech units during speech synthesis.