1. Field of the Disclosure
The present disclosure relates to speech synthesis and more specifically to caching and intelligently fetching parts of voice models for use in speech synthesis.
2. Introduction
Text-to-speech (TTS) synthesis is a valuable technology for hands-free or eyes-free natural interactions with applications running on mobile devices and other small form factor devices, such as smart phones, tablets, in-car infotainment systems, digital home components, and so forth. A TTS engine can run “embedded” on a device, or in the “cloud,” depending on network availability and device capabilities. Both on-device and network-based speech synthesis have advantages and disadvantages. Network-based speech synthesis, in particular, can provide access to large amounts of storage to support very large voice models with good coverage of realistic prosody and phonemic contexts, and to store many different such voice models, supporting varying “personalities” for applications and many different languages. On-device TTS engines, on the other hand, offer reliably low latency responses independent of network conditions or latency, can operate when a network connection is not available, and avoid the costs and overhead associated with deploying and maintaining cloud-based servers.
Existing solutions attempt to reduce the downsides of these approaches by switching between a local a network-based TTS engines on demand. However, these approaches also have downsides of sharp differences between the TTS engines, and still rely on network latency.