1. Technical Field
The present disclosure relates to speech processing and more specifically to predicting and managing speech and language processing models on embedded devices.
2. Introduction
Interactive speech technologies (“IST”), including Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), Text-to-Speech Synthesis (TTS), and Machine Translation (MT) are valuable for hands-free/eyes-free interactions with applications running on mobile and other small devices, such as smartphones, tablets, in-car infotainment systems, home automation systems, and so forth. In many implementations, a network-based server performs IST tasks and communicates the results via a network to/from a local device. However, performing IST tasks on the local device can offer many benefits, such as reliably low latency responses, ability to operate under poor network conditions or when no network connection is available, the ability to tap into information available only on the local device such as calendar appointments, contacts list, personal info, and so forth to build better performing local models compared to generic network models, and reduced cost of building and maintaining large server networks.
However, mobile devices often have limited disk space, especially compared to network-based servers. Therefore, mobile devices cannot store as wide of a range of speech processing models that define the functionality of ISTs. This is particularly true for languages models that each include a separate set of TTS voice models, ASR acoustic and language models, and NLU models. Further, separate ASR and NLU models are usually trained for each task domain, such as dictation, SMS, web search, and so forth. Many of these models can be one to several gigabytes in size, which would quickly fill an unacceptably large fraction of local storage for a typical mobile device and would compete for space with the operating system, apps, photos, music, videos, and other digital content. Furthermore, a user may need speech processing models for applications or in languages that the user had not anticipated when setting up the mobile device.