The present invention deals with speech properties. More specifically, the present invention deals with unit inventories in text-to-speech systems.
Speech signal generators or synthesizers in a text-to-speech (TTS) system can be classified into three distinct categories: articulatory synthesizers; formant synthesizers; and concatenative synthesizers. Articulatory synthesizers are based on the physics of sound generation in the vocal apparatus. Individual parameters related to the position and movement of vocal chords are provided. The sound generated therefrom is determined according to physics. In view of the complexity of the physics, practical applications of this type of synthesizer are considered to be far off.
Formant synthesizers do not use equations of physics to generate speech, but rather, model acoustic features or the spectra of the speech signal, and use a set of rules to generate speech. In a formant synthesizer, a phoneme is modeled with formants wherein each formant has a distinct frequency “trajectory” and a distinct bandwidth which varies over the duration of the phoneme. An audio signal is synthesized by using the frequency and bandwidth trajectories to control a formant synthesizer. While the formant synthesizer can achieve high intelligibility, its “naturalness” is typically low, since it is very difficult to accurately describe the process of speech generation in a set of rules. In some systems, in order to mimic natural speech, the synthetic pronunciation of each phoneme is determined by a set of rules which analyzes the phonetic context of the phoneme. U.S. Pat. No. 4,979,216 issued to Malsheen et al. describes a text-to-speech synthesis system and method using context dependent vowel allophones.
Concatenation systems and methods for generating text-to-speech operate under an entirely different principle. Concatenative synthesis uses pre-recorded actual speech forming a large database or corpus. The corpus is segmented based on phonological features of a language. Commonly, the phonological features include transitions from one phoneme to at least one other phoneme. For instance, the phonemes can be segmented into diphone units, syllables or even words. Diphone concatenation systems are particularly prominent. A diphone is an acoustic unit which extends from the middle of one phoneme to the middle of the next phoneme. In other words, the diphone includes the transition between each partial phoneme. It is believed that synthesis using concatenation of diphones provides good voice quality since each diphone is concatenated with adjoining diphones where the beginning and the ending phonemes have reached steady state, and since each diphone records the actual transition from phoneme to phoneme.
In a concatenative Text-to-speech (TTS) system, speech output is generated by concatenating small pre-stored speech segments one by one. Most state-of-the-art TTS systems adopt corpus-driven approaches, called unit selection, due to their capability to generate highly natural speech. In these systems, a set of “atom units”, that is the smallest constituents in the concatenation procedure that could not be segmented further are defined. Typically there are many instances with phonetic and prosodic variations for the units that are kept in a very large unit inventory, and a unit selection algorithm is used to select the most suitable unit sequence by minimizing a cost function.
Defining a suitable set of atom units is very important for such systems. There is always a balance between two conflicting requirements for the unit inventory. On the one hand, in order to get natural prosody, smaller units are preferred so that a pre-recorded unit inventory could cover as many prosodic variations of each unit as possible. On the other hand, in order to make concatenated utterances smooth, larger units are preferred because they reduce the likelihood of an unsmooth concatenation in the synthesized utterances. Strategies for defining the atom unit differ among languages due to the different phonological characteristics of languages. For languages that have a relatively small syllable set, such as Chinese, which contains less than 2000 syllables, syllables are often used as the atom units. However, using syllables as atom units becomes somewhat impractical for languages that have too many syllables to enumerate effectively. For example, English contains more than 20,000 possible syllables. This makes it difficult to generate a closed list of syllables for English. In such a language, smaller atom units such as the phoneme, diphone or the mixture of the two is often adopted. However, using such small units has many shortcomings.
Using smaller units means more units per utterance and more instances per unit. That is a much larger search space for unit selection and more search time is required during speech generation.
Smaller units also cause more difficulties in precise unit segmentation. This is crucial for speech quality of synthesized speech. For example, in English, the word ‘yes’ consists of three phones, /j/, /e/ and /s/, where the boundary between /e/ and /s/ can be labeled easily, yet it is difficult to separate /j/ from /e/ due to the flat transition between their formant tracks. Moreover, experimentation shows that if the co-articulation between two phones is strong, it is difficult to smoothly concatenate two segments selected from different locations during the synthesis phase.
Therefore, it has been desired for a method to define a set of atom units having a size between phone and syllable to increase the overall efficiency of the text to speech system in large syllable languages such as English