1. Technical Field
The present disclosure relates to speech synthesis and more specifically to preselecting units in unit selection synthesis.
2. Introduction
Many speech synthesis approaches exist, such as concatenative synthesis, formant synthesis, and synthesis based on hidden Markov models. Unit selection synthesis is a sub-type of concatenative synthesis. Unit selection synthesis generally uses a large database of speech. A unit selection algorithm selects units from a database that correspond to the desired units and obey the constraint that adjacent units form a good match. Expressed in mathematical terms, a network of candidate units is constructed and target costs are given to each unit in the network on the basis of some appropriateness measure. A concatenation or join cost represents the quality of concatenation of two speech segments. After constructing the network and assigning costs, the network is examined to determine the lowest cost path through the network. The algorithm then selects and concatenates together units that form the lowest cost path to produce the synthetic speech for the requested text or symbolic input.
A preselection phase cursorily examines candidate units for a synthetic utterance and only uses the most promising in the network calculation phase. This approach can dramatically improve the performance of the system. So long as the preselection is done wisely, preselection does not greatly impact the overall quality of the system. A typical limitation might be to 50 candidates. The speed of such a system is represented in Big O notation as O(n2), where n is the number of candidates.
To be effective, unit preselection should be computationally cheap and performed on the basis of context. The fitness of a unit is determined by comparing the original context of the unit in the voice database to the proposed position of the unit in the context to be synthesized. In an example where a speech synthesizer preselects a vowel V that will occur in a t-V-r context, the synthesizer will favor examples of that vowel that also occur in t-V-r contexts as being more likely to result in high quality synthesis. This system works, but does not perform at an optimal level with regards to accuracy and efficiency.
Existing approaches are approximate and inflexible, tied to the phoneset used for recognition. They compare broad classes, phonemes rather than allophones. Because of this preselected candidate units may be only somewhat appropriate while some very appropriate units fail to make the cut and are not considered further.
Existing approaches are inefficient. System architectures cause a notable bias towards units that occur towards one end of the database, such that some units in the database are underutilized. Effectively such systems are working with a reduced size database.
Previous work has introduced the concept of a pre- and post-vocalic distinction for some of the units in the database. While this has produced candidate lists that consist of generally more appropriate units, one negative effect is a need to replace existing standard phonesets with new specially designed phonesets as part of the solution, hindering synthesizer interoperability. Older work also added code to deal on an ad hoc basis with some other limitations of the preselection system concerned with word boundaries.