The synthesis voice quality for concatenative speech synthesis systems depends on the coverage of speech units in training the speech database under various contexts. Experiments and evaluations have shown that concatenative Text-to-Speech (TTS) systems with large training datasets outperform TTS systems with small training datasets. However, a large speech database can make the footprint of the TTS system too large to be installed on embedded systems like mobile phones or automobiles. In current implementations, the size of server or cloud-based TTS systems can even reach 1 GB, while for embedded TTS systems, the maximum footprint allowed is around 300 MB.
Conventionally, there are two categories of redundant unit pruning approaches for TTS systems. One is referred to as bottom-up, in which case redundant units are purely measured and pruned just by investigating the database itself. Here, the similarity of units is calculated by objective measures which are independent of the unit-selection strategy. There is a drawback to this approach: the units regarded and retained as “similar” and “representative” during unit pruning are not guaranteed to be chosen as replacements for those pruned units in the speech synthesis process. Even in subjective perception they sound similar to each other. That is simply because the criterion for unit reduction is unrelated to the criterion for unit-selection.
The other approach is referred to as up-bottom. In this method, units are pruned based on the analytical results of unit-selection by the TTS system. Redundant units are pruned based on unit appearance frequency (“UAF”), which indicates the unit selection frequency in massive synthesis. The unit appearance frequency (UAF) is generated from the statistical results of massive synthesis on a huge amount of test text scripts. High UAF indicates the unit is frequently selected in the synthesis process, while low UAF means the unit is chosen less often. This method prunes away units with lower UAF.