Re-creation or synthesis of human speech has been an objective for many years and has been discussed in serious texts as well as in science fiction writings. Human speech, like many other natural human abilities such as sight or hearing, is a fairly complicated function. Synthesizing human speech is therefore far from a simple matter.
Various approaches have been taken to synthesize human speech. One approach to human speech synthesis is known as concatenative. Concatenative synthesis of human speech is based on recording wave form data samples of real human speech of predetermined text. Concatenative speech synthesis then breaks down the pre-recorded original human speech into segments and generates speech utterances by linking these human speech segments to build syllables, words, or phrases. The size of the pre-recorded human speech segments may vary from diphones, to demi-syllables, to whole words.
Various approaches to segmenting the recorded original human voice have been used in concatenative speech synthesis. One approach is to break the real human voice down into basic units of contrastive sound. These basic units of contrastive sound are commonly known in the art of the present invention as phones or phonemes.
Another approach to human speech synthesis is known as parametric. Parametric synthesis of human speech uses mathematical models to recreate a desired speech sound. For each desired sound, a mathematical model or function is used to generate that sound. Thus, other than possibly in the creation or determination of the underlying mathematical models, parametric synthesis of human speech is generally devoid of any original human speech input.
There are two general categories of parametric speech synthesizers. One type of parametric speech synthesizer is known as an articulatory synthesizer which mathematically models the physical aspects of the human lungs, larynx, and vocal and nasal tracts. The other type of parametric speech synthesizer is known as a formant synthesizer which mathematically models the acoustic aspects of the human vocal tract.
Referring now to FIG. 1, a typical prior art Text-To-Speech (TTS) System 100 can be seen. The input to TTS System 100 is a text string which may comprise the standard alphabetical characters spelling out the desired text, a phonetic translation of the desired text, or some other form representative of the desired text. The first module of TTS System 100 is Language Processor 101 which receives the input text string or other text representation. The primary function of Language Processor 101, as is well known in the art, is to specify the correct pronunciation of the incoming text by converting it into a sequence of phonemes. By pre-processing symbols, numbers, abbreviations, etc., the input text is first normalized into standard input. The normalized text is then converted to its phonetic representation by applying lexicon table look-up, morphological analysis, letter-to-sound rules, etc.
The second module of TTS System 100 is Acoustic Processor 103 which receives as input the phoneme sequence from Language Processor 101. The primary function of Acoustic Processor 103, as is well known in the art, is to convert the phoneme sequence into various synthesizer controls which specify the acoustic parameters of the output speech. The phoneme sequence may be further refined and modified by Acoustic Processor 103 to reflect contextual interactions. Controls for parameters such as prosody (e.g., pitch contours and phoneme duration), voicing source (e.g., voiced or noise), transitional segmentation (e.g., formants, amplitude envelopes) and/or voice color (e.g., timbre variations) may be calculated, depending upon the specific synthesizer type Acoustic Processor 103 will control.
The third module of TTS System 100 is Speech Synthesizer 105 which receives as input the control parameters of the desired text from Acoustic Processor 103. Speech Synthesizer 105, as is well known in the art, converts the control parameters of the desired text into output wave forms representative of the desired spoken text. Loudspeaker 107 receives as input the output wave forms from Speech Synthesizer 105 and outputs the resulting synthesized speech of the desired text.
In the formant type of parametric speech synthesizer, referring now to FIG. 2, a typical configuration of Speech Synthesizer 105 of FIG. 1 can be seen. With a formant type speech synthesizer, Speech Synthesizer 105 is typically comprised of a voice source 201 and a noise source 203. Voice source 201 is used to simulate the glottis excitation of the vocal tract while noise source 203 is used to simulate some of the other features of the human vocal tract such as the tongue, teeth, lips, etc. As is common in the art, the voice or sound source (after first being passed through a low pass filter, as will be explained more fully below) and the noise source are passed, either singly or in combination through sum circuitry or processing 205, through a resonator and filter network.
The resonator and filter network is typically comprised of a complex network of filters and resonators coupled in parallel and/or cascade fashion whose sole purpose is to create the desired formants for the text to be synthesized. For example, resonators 207, 209 and 211 comprise a cascade resonator configuration while resonators 213, 215 and 217 comprise a parallel resonator configuration. Note that the filter and resonator network of FIG. 2 is merely representative of the type of networks commonly utilized for formant type parametric speech synthesizer. Many combinations and variations of filter and resonator networks have been used in the past.
Finally, the output of the resonator and filter network is combined by sum circuitry or processing 217 and is then modified by some output processing 219 to resolve any impedance mismatch which would typically occur between the mouth and the outside air.
Although numerous variations and combinations of resonators, filters, and other forms of signal processing have been applied to the output of the voice and/or noise sources 201 and 203 in the past, the resulting output from the various resonator and filter networks has typically lacked the naturalness and flexibility desired. Again, the recreation or synthesis of human speech is a complex function which is further compounded by the sensitivity of the standard measuring device--the human ear. If the resulting synthesized speech contains any flat, wooden, static or robotic qualities, the human ear often readily perceives this. The listener's reaction to these imperfections in synthesized speech ranges from minor annoyance to lack of comprehension of the synthesized spoken words.
The present invention overcomes some of the limitations in the prior art speech synthesizers by utilizing a multiplicity of voice sources, one or more of which may comprise a recorded sound wave sample, which thus produces a more natural and flexible synthesized speech sound.
Further, in the prior art parametric speech synthesizers, the source of the synthetic speech has been limited to the voice and noise source modulated and processed by the resonator and filter networks as discussed above. And while concatenative speech synthesizers have utilized recorded human speech segments, the objective there was to essentially use real human speech to generate the desired synthetic speech of the same sound. However, utilization of recorded wave samples of real human speech in place of the voice source in parametric synthesizers is new to the art presumably because of the likely disparity between the recorded human speech segments and the desired spectral characteristics of a voice source.
The present invention takes a different approach than in the prior art by also being capable of utilizing one or more recorded sound samples as the voice source in a parametric speech synthesizer. Utilization of such sound sources provides entirely new, essentially limitless spectral qualities to the voice source of a speech synthesizer. Not only can a wider range of synthetic speech be generated due to the wider variety of voice sources, but further, a wide range of interesting and entertaining speech effects can be achieved. For example, a recorded sound wave sample of a teakettle can be used to create a talking teakettle thus providing an entertaining way to communicate with children who otherwise might lack the interest or attention span to listen to the possibly educational information imparted thereby.