Speech synthesis, also known as text to speech technology, is a technology that can transform text information into a speech and read it. The speech synthesis is a frontier technology in a field of Chinese information processing, involving many subjects, such as acoustics, linguistics, digital signal processing, computer science, and the like, having a main problem of how to transform text information into audible sound information.
In a speech synthesis system, a process of transforming the text information into the sound information is explained as follows. Firstly, it needs to process input text, including pre-processing, word segmentation, part-of-speech tagging, polyphone prediction, prosodic hierarchy prediction, and the like, then to predict acoustic features of each unit via an acoustic model, and finally, to directly synthesize a voice via a vocoder using the acoustic parameters, or to select units from a recording corpus base and splice them to generate sound information corresponding to the text.
The acoustic model is one of foundations of the whole speech synthesis system. The acoustic model is usually obtained by training a large-scale speech data. A process of training the acoustic model is explained as follows. Firstly, a certain number of recording text corpuses are designed to meet requirements of phones coverage and prosody coverage. Secondly, suitable speakers are selected, and the speakers record speech data accordingly. Then, annotation of text, Chinese phoneticize, prosody, and unit boundary is performed, and annotated data is used for model training and speech base generation. It can be seen that, the process of training an acoustic model is complex, and takes a long time, and the training process is based on speech data of fixed speakers, therefore, timbre of synthesized speech is fixed in the process of speech synthesis using the acoustic model.
However, it is wished that own voice, voice of a family member or a friend, or voice of a star is used for speech synthesis in many cases, that is, the user wish the speech synthesized by the speech synthesis system to have personalized speech features. In order to meet requirements for personalized speech, there are mainly following two modes for obtaining a personalized acoustic model in the related art.
The first mode is to train the personalized acoustic model that the user needs at an acoustic parameter level using parallel corpus or non-parallel corpus.
The second mode is to use a mapping between models to realize transformation between a reference acoustic model and a personalized acoustic model. In detail, hidden Markov models and Gaussian mixture models (HMM-GMM for short) are used for modeling, and mappings between decision trees are performed, so as to generate the personalized acoustic model.
However, in a process of realizing the present disclosure, the inventors find there are at least following problems in the related art.
For the first mode, there are two branches as follows. (1) When the personalized acoustic model is trained at the acoustic parameter level using the parallel corpus, it requires two speakers to generate original speech according to a same text, which is sometimes impractical. When using the parallel corpus, a requirement for a scale of corpus may be high, required time is long, and the processing volume is large, such that it is difficult to obtain the personalized acoustic model rapidly. (2) When the personalized acoustic model is trained at the acoustic parameter level using the non-parallel corpus, the two speakers generate the original speech according to different text, and there is obvious difference between pronunciations in different sentence environments for a same syllable. Therefore, if a mapping is performed on a certain same phone in different sentences of different speakers, it is likely to cause the trained personalized acoustic model is not accurate, thus causing that the synthesized speech is not natural enough.
For the second mode, the decision tree is a shallow model, with limited description ability, especially, when the amount of the speech data of the user is small, such that accuracy of the generated personalized acoustic model is not high, thus resulting in incoherent situations in the predicted parameters, and then making the synthesized speech appear phenomena such as a jumping change, unstable timbre, and the like, and resulting in unnaturalness of the speech.