1. Field of the Invention
The present disclosure relates generally to speech synthesis from symbolic input, such as text or phonetic transcription.
2. Background Information
In the past, a variety of systems have been developed that are able to synthesize audible speech from unconstrained symbolic input, such as user-provided text, phonetic transcription, and other input. When text is used as the symbolic input, these systems are commonly referred to as text-to-speech systems.
Such systems generally include a linguistic analysis component (a front end module) that converts the symbolic input into an abstract linguistic representation (ALR). An ALR depicts the linguistic structure of an utterance, which may include phrase, word, syllable, syllable nucleus, phone, and other information. (In some systems, the ALR may also include certain quantitative information, such as durations and fundamental frequency values.) The ALR is passed to a speech generation component (a back end module) that uses the information in the ALR to produce waveforms approximating human speech. A variety of back end approaches have been developed, yet most follow one of two predominant strategies.
The first strategy is often referred to as Rule-Based Speech Synthesis (RBSS). In this strategy, a set of context-sensitive rules is applied to the ALR to yield perceptually appropriate parameter values, such as formant (i.e., vocal tract resonance) frequencies. From these parameter values, a speech synthesizer produces a speech waveform. As used herein, the term speech synthesizer refers only to the specific back end component that produces a waveform from the parameter values, and does not include other components of a speech synthesis system, such as rules. The most widely used RBSS strategy is Rule-Based Formant Synthesis (RBFS), in which the rules directly produce formant frequencies, formant bandwidths, and other acoustic parameter values. Formants appear in speech spectrograms as frequency regions of relatively great intensity, and are important to human perception of speech. Vowels, for example, can often be identified by characteristics of their two or three lowest frequency formants, and the trajectories of formant frequencies at the edges of vowels are often perceptually important cues to the place and manner of articulation of adjacent consonants.
The parameter values produced by an RBFS system are passed to a formant-based speech synthesizer, or formant synthesizer, which uses them to produce a speech waveform. An example of a commonly used formant synthesizer is described in Dennis H. Klatt & Laura C. Klatt, Analysis, Synthesis, and Perception of Voice Quality Variations is Among Female and Male Talkers, 87(2) Journal of the Acoustical Society of America, 820-857 (1990), which is herein incorporated by reference.
RBFS systems have a number of advantages. For example, given appropriate rules, they produce smooth, readily intelligible speech. They also generally have a small memory footprint, are highly predictable (i.e., the characteristics and quality of speech output vary little from one utterance to the next), and can easily generate different voices, voice characteristics (e.g., different degrees of breathiness), pitch patterns, rates of speech, and other properties of speech output “on the fly.”
Unfortunately, offsetting these positive aspects are certain prominent shortcomings. Foremost among these is that speech generated by RBFS systems generally sounds distinctly non-human, having a machine-like timbre, or voice quality. Such speech, while often highly intelligible, would not generally be mistaken for natural human speech. The non-human voice quality of RBFS speech is often particularly pronounced with voices that are intended to mimic female or child speakers. A related shortcoming of RBFS systems is that they are generally poorly suited to producing voices that mimic particular human speakers.
The second back end strategy, Concatenative Speech Synthesis (CSS), offers its own set of advantages and disadvantages. In CSS, speech segments originally derived from recorded human speech (henceforth speech units) are extracted from a database and concatenated to produce the desired utterance.
CSS systems differ as to the number, size, and types of speech units that are employed. Early systems generally employed short, fixed length speech units. Rather than being stored directly as waveforms, the units in these early systems were generally stored in a more compact parameterized form obtained through signal processing, for example in terms of Linear Predictive Coding (LPC) coefficients. A speech synthesizer was then used to construct waveforms from the parameter values. One particularly common type of unit, still in use today, was the diphone (i.e., the second half of one phone followed by the first half of the next, including the transitional portion between the phones). In early diphone systems, for a given combination of phonemes (i.e., each vowel and consonant of the language) usually only a single predetermined unit was stored. For example, for any pair of phonemes, such as /b-a/, /d-a/, /b-i/, /d-i/ etc., a diphone system would generally store a single corresponding speech unit. Such systems, however, while simple, had a number of problems, not the least of which was that due to both the nature of the units themselves and the limited number of them, these systems could not produce many of the required contextual variants of phonemes necessary for natural-sounding speech.
To overcome these problems, more recent CSS systems have employed a much larger number of speech units, often of varying sizes, which are stored directly as waveforms. In fact, modern unit selection synthesis systems often store in their speech databases large numbers of entire phrases or sentences, which are segmented, or labeled, into more basic components, or basic speech units, such as diphones. The precise type of the basic speech units differs depending on the system, with examples including diphones, half-phones, demisyllables, and triphones. Note that in a unit selection synthesis system, in contrast to the early CSS systems discussed above, for a given sequence of phones, there may be many different variants of basic speech units and sequences thereof that could be selected from the database. Regardless of the precise nature of the units, however, the goal of a unit selection system generally remains the same: since there are often many possible units that can be selected to construct a given utterance, the goal is to realize the utterance represented by the ALR by selecting the most appropriate sequence of units from the speech database.
In order to minimize the number of concatenation points, where audible discontinuities and other problems resulting in speech quality degradations may occur, unit selection synthesis systems often attempt to select the longest sequences of adjacent basic speech units possible that will meet the constraints imposed by the unit selection algorithms. In some situations, basic unit sequences encompassing entire words or phrases may be selected. When necessary, however, unit selection synthesis systems must resort to constructing the phoneme sequences in question out of the basic speech units, such as the diphones or half-phones, selected from non-adjacent portions of the stored utterances.
Unit selection CSS systems have the potential to produce reasonably natural-sounding speech, especially in select situations where long sequences of contextually appropriate adjacent basic speech units from a stored utterance can be utilized. However, this potential is offset by a variety of shortcomings. For example, with existing methods, it has proved difficult to produce speech that is at the same time natural-sounding, intelligible, and of consistent quality from utterance to utterance and from voice to voice. Further, higher quality CSS systems often introduce extensive memory and processing requirements, which render them suitable only for implementation on high-powered computer systems and for applications that can accommodate these requirements. Furthermore, even when the necessary processing power and storage requirements are available, large speech databases are still problematic. The more speech that is recorded and stored, the more labor-intensive database preparation becomes. For example, it becomes more difficult to accurately label the speech recordings in terms of their basic speech units and other information required by the back end speech generation components. For this and other reasons, it also becomes more time-consuming and expensive to add new voices to the system.
One challenge facing the developer of a speech synthesis system designed to produce speech from unconstrained input stems from the fact that although there are a limited number of speech sounds, or phonemes, that humans perceive for any given dialect, these phonemes are realized differently in different contexts. Among the factors that influence the acoustic realizations (variants) of a phoneme are the neighboring segments of the phoneme, the amount of stress of the syllable containing the phoneme, the phoneme's syllable position, word position, and phrase position, and the rate of speech.
Consider, for example, the words dad and bat. These words each have the same vowel phoneme /æ/. However, when these words are spoken, the directions and other characteristics of the formant transitions at the beginning of the vowel (reflecting the movement of the articulators from the initial consonant [d] or [b] into the vowel) differ in each case. The particular characteristics of the formant transitions are important perceptual cues to the place of articulation of the word-initial consonant. Thus the words dad and bat could not be created using the same vowel units. In fact, the important perceptual function of different formant transitions is one of the main motivating factors behind the use of diphones and other common basic units underlying CSS synthesis, which are generally designed to preserve these transitions.
However, it is not only the transitions at the edges of vowels that may differ in different contexts, but other portions of vowels as well. For example, another important perceptual difference between the vowels in dad and bat in many dialects of English is that the vowel of dad is considerably longer than that of bat (provided that both words occur in otherwise similar contexts), since the vowel precedes a voiced consonant ([d]) in the same syllable as opposed to a voiceless one ([t]). The different vowel durations in the two words are perceptually important cues to the voicing characteristics of the post-vocalic consonants. To complicate matters further, transition and non-transition portions of vowels may lengthen and shorten non-uniformly (e.g., transitions at the edges of vowels may remain relatively stable in duration while the remaining portion of the vowel lengthens). Formant values and other characteristics of vowels may also be influenced by a variety of contextual factors. Thus in a system that constructs vowels from separate units (e.g., separate diphones) originally spoken in different utterances and/or contexts, it is a challenge to select the units not only such that they produce appropriate transitions for the context, but also appropriate overall durations, formant patterns, and the like. The difficulty of producing appropriate acoustic patterns is compounded by the fact that what are linguistically single vowels are often split across the basic units underlying CSS systems.
There is a need, then, for new techniques that improve upon both the existing RBSS and CSS techniques used in the back end of speech synthesis systems. While RBSS techniques, at least in principle, have the flexibility to produce virtually any contextual variant that is perceptually appropriate in terms of duration, fundamental frequency, formant values, and certain other important acoustic parameters, the production of human-sounding voice quality or speech that mimics a particular speaker has remained elusive, as mentioned above. While certain CSS techniques at least in principle can mimic particular voices and create natural-sounding speech in cases where appropriate units are selected, excessively large databases are required for applications in which the input is unconstrained, and further, the unit selection techniques themselves have been less than adequate.
Specifically, synthesis techniques are needed that can be used in a single synthesis system that combines the best features of RBSS and CSS systems, rather than trading one feature for another. Such techniques should provide for human-sounding speech, the ability to mimic particular voices, cost-efficient development of voices, dialects, and languages, consistent speech output, and use of the system on a large range of hardware and software configurations including those with minimal memory and/or processing power.