With the rapid development of mobile internet and artificial intelligence technology, scenes of speech synthesis (such as voice broadcast, listening to novels or news, intelligent interaction, etc.) have been becoming more and more popular.
At present, when a speech synthesis system performs a speech synthesis on text, the input texts are normalized firstly. Then, operations such as word segmentation, part-of-speech tagging and phonetic notation are performed on the source text. In the next step, the prosodic hierarchy of text and acoustic parameters are predicted. Finally, the speech output is obtained.
However, the configuration of speech synthesis system is usually fixed, which cannot be set flexibly according to an actual scene and a condition of loading, such that it cannot adapt to speech synthesis requests under different environments. For example, when the speech synthesis system receives a large number of speech synthesis requests in a short period of time, the load capacity of speech synthesis system is likely to be out of bounds, which can lead to an accumulation of speech synthesis requests. As a result, users cannot receive feedback in time and their using experience will be affected.