Re-creation or synthesis of human speech has been an objective for many years and has been discussed in serious texts as well as in science fiction writings. Human speech, like many other natural human abilities such as sight or hearing, is a fairly complicated function. Synthesizing human speech is therefore far from a simple matter.
Various approaches have been taken to synthesize human speech. One approach is known as parametric. Parametric synthesis of human speech uses mathematical models to recreate a desired sound. For each desired sound, a mathematical model or function is used to generate that sound. Thus, other than possibly in the creation of the underlying mathematical models, parametric synthesis of human speech is completely devoid of any original human speech input.
Another approach to human speech synthesis is known as concatenative. Concatenative synthesis of human speech is based on recording samples of real human speech. Concatenative speech synthesis then breaks down the pre-recorded original human speech into segments and generates novel speech utterances by linking these speech segments to build syllables, words, or phrases. The size of the pre-recorded speech segments may vary from diphones, to demi-syllables, to whole words.
Various approaches to segmenting the recorded original human voice have been used in concatenative speech synthesis. One approach is to break the real human voice down into basic units of contrastive sound. These basic units of contrastive sound are commonly known in the art of the present invention as phones or phonemes.
It is generally agreed that in General American English (a variety of American English that has no strong regional accent, and is typified by Californian, or West Coast American English), there are approximately 40 phones. Note that this number may vary slightly, depending upon one's theoretical orientation, and according to the quality level of synthesis desired. Thus, to synthesize high quality speech, a few sounds may be added to the basic set of 40 phones. In the preferred embodiment of the present invention, there are a total of 50 phones (see Appendix A) used. Again, these 50 phones consist of real human speech pitch-period waveform data samples.
However, generating human speech of a quality acceptable to the human ear requires more than merely concatenating together again the phones which have been excised from real human speech. Such a technique would produce unacceptably choppy speech because the areas of most sensitive acoustic information have been sliced, and rule-based recombination at these points will not preserve the fine structure of the acoustic patterns, in the time and frequency domains, with adequate fidelity.
A better, and commonly used, approach is therefore to slice up the real original human speech at areas of relative constancy. These areas of relative constancy occur, for example, during the steady state (middle) portion of a vowel, at the midway point of a nasal, before the burst portion of a stop consonant, etc. In order to concatenate human speech phones at these points or areas of relative constancy, segments known as diphones have been created that are composed of the transition between one sound and an adjacent sound. In other words, a diphone is comprised of a sound that starts in the center or one phone and ends in the center of a neighboring phone. Thus, diphones preserve the transition between sounds.
Note that the second half of one diphone and the first half of a following diphone (each known as a `demi-diphone`) is, therefore, frequently the physical equivalent of a phone.
To produce a diphone, two successive phones or sounds are sliced at their approximate midpoints and appended together. For example, the four different phones within the word `cat` are [SIL], [k], [AE], and [t]. Therefore, the four sets of two demi-diphones (each comprising roughly one half of a phone), or diphones, used for the word `cat` are: 1. [SIL] to [k]; 2. [k] to [AE]; 3. [AE] to [t]; and 4. [t] to [SIL].
In human speech it is possible, generally speaking, to make a transition from any phone to any other phone. Having 50 possible phones for General American English yields a matrix or table of 2500 possible diphone samples. Again, each of these diphone samples is thus comprised of the ending portion of one phone and the beginning portion of another phone.
Of course, there are many diphones that never occur in General American English. Two such sounds are: 1) SIL-NG, because no English word begins with a velar nasal, such as occurs at the end of `sing` (sIHNG); and 2) UH-EH, because no English word or syllable ends with the lax vowel UH, such as occurs in `put` (pUHt). Thus, if all the diphone data needed to handle all possible transitions from one General American English sound to another were sampled, the actual number of required samples would only be approximately 1800.
Of course, accurately recording 1800 different diphones requires a concerted effort. Situations have occurred where real human speech samples were taken only to later find out that some of the necessary diphones were missed. This lack of all necessary diphones results in less than acceptable sound synthesis quality.
What has been done in the prior art is to replace missing diphones with recorded diphones that are somewhat similar in sound (referred to in the art as `aliasing`). Take the case of the missing diphone [k] to [AE] (again, as occurs in the word `cat`). Possibly the ending portion of the phone [k] from the demi-diphone which begins the diphone [k] to [EH] (as occurs in the word `kettle`) could be used as a beginning portion for the missing diphone. And possibly the beginning portion of the phone [AE] from the demi-diphone ending of the diphone [KX] to [AE] (as occurs in the word `scat`) could be used as the ending portion for the missing diphone. Then, the combination of these two demi-diphone portions could be used to fill in for the missing [k] to [AE] diphone. Thus, what has been done in the prior art is to alias demi-diphones for each half of a missing diphone. However, in the prior art, replacing missing diphones with existing sampled diphones (or two demi-diphones) was done in a haphazard, non-scientific way. The prior art aliasing thus usually resulted in the missing diphones (which were subsequently aliased to stored diphones or demi-diphones) lacking the natural sound of real human voice, an obviously undesirable result in a human speech synthesis system.
Because no formalized aliasing approach is known to exist in the art, prior art text-to-speech or speech sound synthesis systems which did not include samples of all necessary diphones lacked the natural sound of a real human voice. The present invention overcomes this limitation in the prior art by setting forth such a formalized aliasing approach.
The formalized aliasing approach of the present invention thus overcomes the ad hoc aliasing approach of the prior art which oftentimes generated less than satisfactory speech synthesis sound output. Further, storing 1800 different diphone samples can consume a considerable amount of memory (approximately 3 megabytes). In memory limited situations, it may not be feasible or desirable to store all of the needed diphones. Therefore, the formalized aliasing approach of the present invention can also be used to lessen storage requirements for speech sound samples by only storing as many sound samples as memory capacity can support and utilizing the structured aliasing approach of the present invention to provide the needed sounds which are not stored.
Further, the uses of synthetic speech range from simple sound output to animation and `intelligent` assistants which appear on a display device to instruct the user or to tell the user about some event. In order to make the animation seem life-like, the sound output and the facial movements must be synchronized. Prior art techniques for creating synchronized lip animation so that facial images appear to `speak,` i.e. articulate their lips, tongue and teeth, in synchrony with a recorded sound track has been to use a limited set of `visemes.` A viseme is a minimal contrastive unit of visible articulation of speech sounds, i.e. a distinctive, isolated, and stationary articulatory position typically associated with a specific phone. Of course, for certain visemes, tongue and teeth image position is also relevant. An example set of visemes, along with a line drawing highlighting the most salient features of each, can be seen in FIG. 3.
In the prior art, when using visemes in conjunction with General American English, the number of visemes typically ranged from 9 to 32. This is in contrast to the approximately 40 (or 50, as explained herein) basic units of contrastive sounds, or phones, used in General American English. Phones (or phonemes) are the units in the speech domain which may be thought to parallel visemes in the visual domain, because both are minimal contrastive units, and both represent distinctive, isolated units in a theoretical set.
Further, in the prior art, in order to synchronize the phones to the visemes in a synthetic speech system, a mapping was made between the sound being generated and the image being displayed. This was done by mapping one viseme to each of the 40 or 50 phones and then, as the sound transitioned between phones the displayed image transitioned between the associated visemes.
However, as has already been explained herein, phones have not been found to be the best approach in producing high-quality synthesized speech from concatenative units. This is, again, due to the unacceptably choppy speech caused by trying to recombine phones at the areas of most sensitive acoustic information. Instead, diphones (made up of portions of phones which have been combined at their areas of relative constancy) have been used in the prior art. A similar problem results from merely trying to animate from one viseme to another viseme. The resulting image does not accurately reflect the facial imaging which occurs when a human speaker makes the same vocal or sound transition. Thus, what is needed is a mapping between synthetic speech and facial imaging which more accurately reflects the speech transitional movements for a realistic speaker image.