Cloud-base speech services and other voice-recognition services generally include numerous models and databases, e.g., language and acoustic models and models and databases for different information domains, end-user devices and so on, to allow for performance of speech recognition tasks. Each model may be directed to specific aspects of audio processing such as identifying the acoustic environment for the captured audio, e.g., for noise-cancellation purposes, and to aid in determining the content of the speech such as the particular language of the speaker. Other models may also be utilized in order to identify the particular type of voice task the user desires. Such tasks may include, for example, navigational commands, calendar/scheduling functions, Internet searches, and other general questions/tasks that may be satisfied through one or more artificial intelligence engines in combination with additional knowledge sources. Often a user has no awareness of the number of models and application engines/modules utilized for even the simplest of voice commands as cloud-based speech services hide much of the complexity from view.
However, the storage size of models can be relatively large, particularly for language models. Language models directed to specific clients/end-users may require gigabytes of storage. As systems scale from hundreds to thousands of clients/end-users, and beyond, the storage costs increase proportionally. Some approaches to scale voice services include cluster configurations which achieve efficiencies through distributing models and application engines/modules among various storage and application nodes, respectively. However, such distribution requires that each request to perform speech recognition be handled by a server-side manager (sometimes referred to as a grid manager) that is able to retrieve necessary models from the storage nodes and interact with application nodes as needed. The response time to handle each voice request is thus directly related to the delay to acquire voice samples by the speech services system, the time to analyze the received audio samples to determine appropriate models/applications, the time to load those models for further processing, and then the processing of received audio samples using the loaded language models and applicable applications to determine a final result.
The grid manager required by such speech services systems thus acts as a middleman that unfortunately introduces latencies which increase the overall amount of time between a user issuing a voice request/command to a device and the speech services system ultimately providing a final response back to the requesting device. Even relatively minor delays, e.g., in the tens of milliseconds, serves to decrease usability of such systems and impedes continued adoption of speech-enabled services/devices by users.
These and other features of the present embodiments will be understood better by reading the following detailed description, taken together with the figures herein described. The accompanying drawings are not intended to be drawn to scale. In the drawings, each identical or nearly identical component that is illustrated in various figures is represented by a like numeral. For purposes of clarity, not every component may be labeled in every drawing.