Synthesizing realistic sound human speech in real-time based upon linguistic features and F0 is a challenging problem. The application of deep learning to speech synthesis such as the WaveNet project has produced promising results. Deep-learning approaches to speech synthesis such as WaveNet have many applications, including the classical text-to-speech (“TTS”) problem. While initially WaveNet and others addressed TTS starting from linguistic features, ensuing work showed that speech could be synthesized directly from input text. The approach has also been adapted to other problems, including voice conversion, speech enhancement, and musical instrument synthesis.
Despite the impressive quality of the synthesized waveform, deep learning techniques such as WaveNet still suffer from several drawbacks. In particular, these approaches require substantial training corpus (roughly 30 hours), the synthesis process is slow (about 40 minutes to produce a second of audio), and the result contains audible noise.
More recent work showed that WaveNet could also be used as a vocoder, which generates a waveform from acoustic features, rather than linguistic features. Working from acoustic features, the training process is effective with a substantially smaller corpus (roughly one hour) while still producing higher quality speech than baseline vocoders like mel-log spectrum approximation (MLSA). Several research efforts have addressed the problem of computational cost including algorithmic improvements for the same architecture called Fast WaveNet, which can synthesize a second of audio in roughly a minute. Other efforts have been able to achieve real-time synthesis by reducing the WaveNet model size significantly, but at the expense of noticeably worse voice quality. Other efforts have facilitated parallelization of WaveNet for GPU computing allowing real-time operation with some GPU clusters. However, this method does not reduce actual computational costs, but instead demands a far costlier hardware solution.
In general, deep-learning techniques for performing speech synthesis such as WaveNet suffer from significant drawbacks, namely requiring a large training corpus and having slow synthesis time, and therefore new approaches are necessary. Further, known methods such as the WaveNet model suffer from high computational complexity due to the employment of a dilated convolution and gated filter structure. Thus, deep-learning techniques for performing speech synthesis achieving a large receptive field for correlating audio samples far in the past with a current input sample that do not impose significant computational penalties are required.