Speech synthesis, also known as text to speech technology, can transform any text information into a standard and fluent speech to be read out in real time, equivalent to installing an artificial mouth on the machine. In speech synthesis, firstly, it needs to process input text, including pre-processing, word segmentation, part-of-speech tagging, phonetic notation, prosodic hierarchy prediction, and the like, then to generate acoustic parameters via an acoustic model, and finally, to synthesize a voice via a vocoder using the acoustic parameters or select units from a recording corpus base for splicing.
In the related art, the generation process of the acoustic model takes a long time and cannot meet individual demands.