1. Technical Field
The present disclosure relates to speech synthesis and more specifically to a more efficient approach to unit selection based speech synthesis.
2. Introduction
A number of practical questions must be addressed for unit selection to operate efficiently. Building and using a unit selection database requires storage and rapid manipulation of large quantities of data such as speech units and their associated metadata. Existing unit selection algorithms can be too slow for real time synthesis based on such large quantities of data. High quality speech databases have tens of thousands or more speech units of different sounds, pitches, speeds, durations, and so forth. The functional complexity of unit selection is O(n2) because each list of n speech units is compared to n other speech units.
The basic approach can be unworkable with high quality speech databases and is inefficient with lower quality speech databases, leading to extra processing, storage, and memory requirements for speech synthesis systems and/or reduced quality synthesized speech. One approach to accelerating the runtime calculation of a path through the unit selection network is join cost caching. For example, a large body of text can be synthesized and the costs associated with the units used can be cached to speed up synthesis, without an enormous space penalty. Another approach to this problem is preselection. Preselection assigns a context-based cost to individual units prior to calculating the complete target cost. The context-based cost is used for pruning the number of possible candidates, which may number several thousand for a particular phone type, down to a number which can be used efficiently in the network—perhaps tens or low hundreds.
Even with join cost caching or preselection, the number of candidate units for synthesis is often very large. Accordingly, what is needed in the art is a more efficient way to perform unit selection in speech synthesis systems.