Text-to-speech (TTS) technology allows computerized systems to communicate with users through synthesized speech. The quality of these systems is typically measured by how natural or human-like the synthesized speech sounds.
Very natural sounding speech can be produced by simply replaying a recording of an entire sentence or paragraph of speech. However, the complexity of human communication through languages and the limitations of computer storage may make it impossible to store every conceivable sentence that may occur in a text. Because of this, the art has adopted a concatenative approach to speech synthesis that can be used to generate speech from any text. This concatenative approach combines stored speech samples representing small speech units such as phonemes, di-phones, tri-phones, or syllables form larger speech signals.
Today, TTS libraries are limited to a certain amount of speech samples from which speech may be generated. These TTS libraries are limited to speech samples that have high spectral similarity in a point where the speech samples may be concatenated together. Typically, spectral similarity is determined based on spectral clustering techniques that are known in the existing art used to cluster similar symbols. Spectral similarity may be determined by using, for example, short-time Fourier transforms (STFT) or, alternatively, Mel-frequency cepstral coefficients (MFCC).
TTS synthesis techniques using the TTS libraries can face a number of difficulties, such as, requiring large amounts of space for speech samples storage needed to produce rich repositories of speech. When such space is not available, a poor speech repository is produced. Moreover, concatenating speech samples is limited to the context in which the speech samples were spoken. For example, in the sentence “Joe went to the store,” the speech units associated with the word “store” have a lower pitch than those associated with the question “Joe went to the store?” Because of this, if stored speech samples are simply retrieved without reference to their pitch or duration, some of the speech samples could have the wrong pitch and/or duration for the concatenated speech samples, thereby resulting in unnatural sounding speech.
Existing solutions for producing speech typically require that speech samples be separated from each other and concatenated together in new combinations to enrich the speech repository. However, the drawback of such solutions is that the initial speech sample separation may be inaccurate, therefore production of speech from those speech samples may yield ineffective results. One way to overcome this drawback is by producing speech from a very large set of speech samples. However, maintaining all the speech samples will significantly increases the size of a TSS library, and the time for processing such speech samples.
It would therefore be advantageous to provide an efficient solution for optimally and dynamically concatenating speech samples.