Speech synthesis is a growing technology with applications in areas that include, but are not limited to, automated directory services, automated help desks and technology support infrastructure, human/computer interfaces, etc. Speech synthesis typically involves the production of electronic signals that, when broadcast, mimic human speech and are intelligible to a human listener or recipient. For example, in a typical text-to-speech application, text to be converted to speech is parsed into labeled phonemes which are then described by appropriately composed signals that drive an acoustic output, such as one or more resonators coupled to a speaker or other device capable of broadcasting sound waves.
Speech synthesis can be broadly categorized as using either concatenative or formant-based methods to generate synthesized speech. In concatenative approaches, speech is formed by appropriately concatenating pre-recorded voice fragments together, where each fragment may be a phoneme or other sound component of the target speech. One advantage of concatenative approaches is that, since it uses actual recordings of human speakers, it is relatively simple to synthesize natural sounding speech. However, the library of pre-recorded speech fragments needed to synthesize speech in a general manner requires relatively large amounts of storage, limiting application of concatenative approaches to systems that can tolerate a relatively large footprint, and/or systems that are not otherwise resource limited. In addition, there may be perceptual artifacts at transitions between speech fragments.
Formant-based approaches achieve voice synthesis by generating a model configured to build a speech signal using a relatively compact description or language that employs at least speech formants as a basis for the description. The model may, for example, consider the physical processes that occur in the human vocal tract when an individual speaks. To configure or train the model, recorded speech of known content may be parsed and analyzed to extract the speech formants in the signal. The term formant refers herein to certain resonant frequencies of speech. Speech formants are related to the physical processes of resonance in a substantially tubular vocal tract. The formants in a speech signal, and particularly the first three resonant frequencies, have been identified as being closely linked to, and characteristic of, the phonetic significance of sounds in human speech. As a result, a model may incorporate rules about how one or more formants should transition over time to mimic the desired sounds of the speech being synthesized.
Generally speaking, there are at least two phases to formant-based speech synthesis: 1) generating a speech synthesis model capable of producing a formant tract characteristic of target speech; and 2) speech production. Generating the speech synthesis model may include analyzing recorded speech signals, extracting formants from the speech signals and using knowledge gleaned from this information to train the model. Speech production generally involves using the trained speech synthesis model to generate the phonetic descriptions of the target speech, for example, generating an appropriate formant tract, and converting the description (e.g., via resonators) to an acoustic signal comprehensible to a human listener.