Machine generated speech can be produced in many different ways and for many different applications, but there are two basic methods for synthesizing speech signals currently in wide-spread use. One method attempts to construct speech signals using a model, while the other method concatenates pre-stored speech segments. Model-based approaches tend to be efficient in storage and flexibility, but produce rather synthetic sounding speech. An example of model-based speech synthesis is Hidden-Markov-Model (HMM) based speech synthesis described, for example, in T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Simultaneous Modeling of Spectrum, Pitch And Duration In HMM-Based Speech Synthesis,” in Proc. of Eurospeech, 1999, pp. 2347-2350, incorporated herein by reference.
The other method of speech synthesis, segment-concatenation, can produce very natural speech at its best, but is rather inflexible and requires large amounts of storage. A large corpus of speech data needs to be recorded and accurately labeled for use in a commercially viable text-to-speech system. An example of a segment-concatenation based approach is the Realspeak TTS system described, for example, in G. Coorman, J. Fackrell, P. Rutten & B. Van Coile, “Segment Selection In The L&H Realspeak Laboratory TTS System”, Proceedings of ICSLP 2000, pp. 395-398, incorporated herein by reference.
Table 1 establishes a typology of both TTS methods according to various characteristics:
CategorySegment concatenationModel-basedSpeech qualityUneven quality, highly naturalConsistent speech quality, butat best. Typically offers goodwith a synthetic “processed”segmental quality, but maycharacteristic.suffer from poor prosody.Corpus-sizeQuality is critically dependentWorks well on a small corpusupon the size of the soundinventorySignal manipulationMinimal to noneSignal manipulation by defaultBasic Unit topologyWaveformsSpeech parametersSystem footprintSimple coding of the speechHeavy modelling of the speechinventory leads to a largesignal results in a small systemsystem footprintfootprint. Systems areresilient to reduction insystem footprint.Generation qualityQuality is variable dependingSmooth and stable, moreupon the length of continuouspredictable behaviour withspeech selected from the unitrespect to previously unseeninventory. For example, limitcontexts.domain systems, which tendto return long stretches ofstored speech duringselection, typically producevery natural synthesis.Corpus-qualityNeed accurately labelled dataTolerant towards labelling mistakesAs seen in Table 1, one significant difference between these two approaches is that model-based methods can construct previously unseen sounds (e.g., for a given prosodic context), where as segment-based systems are constrained by their segment coverage. The dynamic construction of “unseen” sounds by falling back on other sub-segment model properties is a feature that enables generalization.