A hybrid approach being explored recently in Text-to-Speech Synthesis (TTS) includes concatenating natural speech segments and artificial segments generated from a statistical model. Herein, this approach is referred to as Multi-Form Segment (MFS) synthesis, the natural segments are referred to as template segments or templates, and the artificial segments generated from statistical models are referred to as model segments. A voice dataset of an MFS TTS system contains a templates database and a set of statistical models typically represented by states of Hidden Markov Models (HMM). Each statistical model corresponds to a distinct context-dependent phonetic element. A many-to-one mapping exists that establishes an association between the templates and the statistical models. In synthesis time, input text is converted to a sequence of the context-dependent phonetic elements. Then, each element can be represented by either a template or a model segment generated from the corresponding statistical model.
The motivation behind the MFS approach is to combine the advantages of unit selection or the concatenative TTS paradigm, which operates purely on template segments, and the statistical TTS paradigm to build a flexible system that produces natural sounding speech with stable quality for a wide range of system footprints. However, if the voice character differs significantly between the concatenated template and model segments, the switching between the template and model segments deteriorates human perception. The perceptual quality of the MFS synthesis output strongly depends on the representation type (template versus model) selected for each segment comprising the synthesized sentence. If the representation type decision is made off-line prior to synthesis for all of the segments available within the voice dataset then the templates database can be pruned, resulting in system footprint reduction as model segments can be stored more compactly compared to template segments.
In another context, there is the problem of how to select a speaker for building a statistical TTS system. Voice dataset preparation for a statistical TTS model training is an intensive human labor and time consuming process. It typically includes the recording of several hours (e.g., 5-10 hours) of speech in a studio environment that is done in several sessions, and several person-weeks are required afterwards for manual error correction in speech transcripts and in phonetic alignment. Characteristics of the recorded voice significantly influence the final quality of the generated speech. The models produced from one speaker perform better than those built from another, while the gender, recording conditions, and the build process are very similar.