Speech is a primary form of communication, capable of conveying both information and emotion. Information is conveyed by words, while emotion is typically expressed by inflections in a speaker's voice. In humans, speech waveforms are created by vocal cords, located in the speaker's larynx. The waveforms then propagate through a vocal cavity, consisting of a series of flexible, irregularly shaped tubes, including the speaker's throat, mouth, and nasal passages. At the speaker's lips and various other structures, parts of the waveforms are further transmitted, while other parts are reflected. Flow of the waveforms may be significantly constricted or even completely interrupted by the speaker's uvula, teeth, tongue or lips.
Voiced sounds, such as vowels, occur when the vocal cords produce a regular waveform. Unvoiced sounds, such as consonants, occur when some part of the vocal cavity is tightened, restricting transmission of the waveforms.
The waveforms produced may be characterized by many parameters, including frequency and amplitude. Using Fourier analysis, speech waveforms may be represented in a frequency domain as a spectral frame, consisting of spectral components. The spectral frame contains the waveform's lowest, or fundamental, frequency, along with its harmonics (spectral components which occur at multiples of the fundamental frequency). Spectral components from string instruments and from vowels in speech typically occur at close to whole number multiples of the fundamental frequency, while spectral components from percussion instruments often occur at non-integral multiples of the fundamental frequency.
Humans are particularly sensitive to peaks and valleys in an overall shape of the spectral frame. Viewed in the frequency domain, the shape of the spectral frame is characterized by a number of formants. A formant, for purposes of the present discussion, is defined as a frequency region, spanning two or more harmonics, in which the amplitudes of the spectral components are significantly raised or lowered. In musical instruments, formants are formed by the shape of a resonating body. As different notes are played, the fundamental frequency changes, while the formants remain fixed. This fixed formant pattern allows a listener to identify different musical instruments easily and even to distinguish otherwise identical instruments (such as Stradivarius violins) from one another.
In speech, formants are created by the shape of the speaker's vocal cavity, including a position of the speaker's tongue and jaw. A basic unit of speech differentiation is a phoneme, defined as a sound at the level of consonants and vowels. A phoneme may be represented in the frequency domain as a single spectral frame, having a particular formant pattern. By changing the vocal cavity, a speaker can form different formants, and therefore, different phonemes, diphthongs, syllables and words.
With the widespread availability of computers with multimedia capability, it is desirable to enable computers to reproduce or synthesize both human speech and musical sounds. Computers use a number of different technologies to create sounds. Two widely used techniques are frequency modulation (FM) synthesis and wavetable synthesis.
Used extensively in digital musical and multimedia devices, FM synthesis techniques generally use one or more periodic modulator signals to modulate a frequency of a sinusoidal carrier signal. Though useful for creating expressive new synthesized sounds, FM synthesis techniques have proven disappointing at accurately recreating natural sounds.
An important factor in the utility of any synthesis technique is a degree of control that a user can exercise over the sounds produced. Wavetable synthesis systems, for example, can store high quality sound samples digitally and then replay these sounds on demand. Waveshaping synthesis is another approach that provides the user with a high degree of control over the spectral frame of an output signal. Sampled sounds are digitized and represented in the frequency domain as a spectral frame, containing a distinctive formant pattern. Using conventional techniques, the spectral frame can then be represented as a non-linear transfer function. Waveshaping synthesis is performed by driving the non-linear transfer function with a sinusoidal signal at a fundamental frequency. Waveshaping synthesis techniques were used in a few early digital music synthesizers such as the Buchla 400 series and, more recently, in the Korg 01/W.
FM and wavetable synthesis are the predominant multimedia synthesis methods. Waveshaping synthesis is an alternative technique that can also be used in applications involving the reproduction of human speech. To produce a sound having a particular tonal quality, the user must first select the appropriate transfer function containing the sprectral frame and formant pattern information. Musical tones are then produced by driving the transfer function with the appropriate fundamental frequency.
Human speech relies heavily on inflection to carry emotional content. A lack of inflection is therefore a disadvantage. Adding inflection to speech necessarily involves a shifting in a fundamental frequency of the speech. Any shift in the fundamental frequency, however, results in a corresponding shift in the formant pattern. The formant pattern, of course, must be reproduced without any substantive changes for the resulting speech to be understandable. Shifts in the formant pattern, therefore, result in a loss of speech intelligibility and reality.
One solution to speech synthesis that allows incorporation of inflection while retaining intelligibility is linear predictive coding (LPC), an intensely mathematical process that models a vocal cavity as a series of filters. LPC calculates coefficients of the filters independently of the fundamental frequency. Shifts in the fundamental frequency due to inflection therefore do not affect the formant patterns produced by the filters. While LPC is capable of providing inflected speech of a general model, its computational costs are prohibitive when using filters of a complexity necessary to reproduce the speech of a specific speaker. As a result, most existing speech synthesis techniques have used less complex filters, resulting in comically mechanical speech that is robotic., artificial, and devoid of emotional content.
Accordingly, what is needed in the art is a system and method for incorporating inflection into speech synthesis while avoiding a corresponding shift in the formant pattern and a resulting loss of intelligibility and reality.