1. Field
Embodiments of the present invention generally relate to speech recognition. More particularly, embodiments of the present invention relate to processing multiple input speech streams using one or more acoustic models.
2. Background
Real-time data pattern recognition is increasingly used to analyze data streams in electronic systems. On vocabularies with over tens of thousands of words, speech recognition systems have achieved improved accuracy, making it an attractive feature for electronic systems. For example, speech recognition systems are increasingly common in consumer markets targeted to data pattern recognition applications such as in mobile device, server, automobile, and PC markets.
Despite the improved accuracy in speech recognition systems, significant computing resources are dedicated to the speech recognition process, in turn placing a significant load on computing systems such as, for example, the memory environment.
The memory environment stores data from multiple frames that are being analyzed. This requires a large memory array, both in terms of the number of entries in the array and the width of each entry. Such large memory arrays can be slow and require significant power in order to read from and write to them. The size of the memory and the load placed on the memory by the speech recognition process affects the speed at which the computing system can process incoming voice signals, while executing other applications. Further, for handheld devices that typically include limited memory resources (as compared to desktop computing systems, for example), speech recognition applications not only place significant load on the handheld device's computing resources but also consume a significant portion of the handheld device's memory resources.
The above speech recognition system issues of processing capability, speed, and memory resources are further exacerbated by the need to process incoming voice signals in real-time or substantially close to real-time. To reduce the resource constraints placed on user devices, while still providing real-time or substantially close to real-time speech recognition, the speech processing computation can be done by central servers. An advantage, among others, of using a centralized server to process speech is that the resource intensive operations can be executed by a more powerful system, leaving the resources on user devices available for other applications. This is especially useful for mobile applications, like car systems or phone systems, where resources can be limited.
But, in the server environment, memory bandwidth has become a significant area of contention. As discussed above, vocabularies can have tens of thousands of words. In addition, an acoustic model associated with a vocabulary can have thousands of senones, representing the sounds that make up the vocabulary. Speech recognition processing of such vocabularies can be computationally intensive for the server environment, especially if it must be done in real-time or substantially close to real-time.