The invention relates generally to speech recognition. More specifically, the invention relates to a method of performing automatic speech recognition for multiple users on a heterogeneous CPU-GPU platform.
Modern multi-user applications are often challenged by the need to scale to a potentially large number of users while minimizing the degradation in service response even under peak load conditions. In particular, large vocabulary continuous speech recognition (LVCSR) applications present an additional hurdle because of the disparity between the number of potentially active users and the limited system ability to provide computationally intensive automatic speech recognition (ASR) services.
In one previous ASR framework, a distributed speech recognition (DSR) system assigned one thread or process per client until the number of clients approached the server peak capacity to prevent performance degradation. This approach generally works well on modern homogeneous multi-core CPU platforms. However, many ASR systems incorporate a GPU-accelerated speech recognition engine to overcome the capacity limitation of conventional multi-core ASR systems because GPUs can accelerate decoding speed significantly. The GPU-accelerated framework presents challenges to scaling because GPUs can only process one GPU kernel at a time and the number of GPU devices per server is limited. As a result, GPU processes can become a serial bottleneck in the overall ASR system despite the GPUs significantly accelerating computationally intensive phases. Therefore, a multi-user ASR engine architecture needs to be specially optimized for CPU-GPU heterogeneous platform to efficiently support many users.
The server arrangement itself poses another issue when scaling for multiple users. Researchers have proposed server arrangements to improve the capacity and efficiency of the overall speech recognition system. One arrangement uses an event-driven, input-output non-blocking server framework, where a dispatcher, routing all the systems events, buffers clients queries on the decoder proxy server, which redirects the requests to a free ASR engine. Another arrangement presents an alternative architecture, where the entire ASR system has been decomposed into 11 functional blocks and interconnected via a hub to allow a more efficient parallel use of the ASR system. However, these works investigate only the optimal server arrangement assuming an ASR engine can support multiple users efficiently. As described above, this is not the case when GPU acceleration is utilized.
Another arrangement proposed a GPU-accelerated ASR engine architecture investigating an optimal task scheduling to minimize the task wait time and share the acoustic model parameters to process more users. However, this work did not propose an ASR engine architecture that leveraged multicore CPUs and GPUs at the same time. It would therefore be advantageous to develop a GPU-accelerated ASR engine architecture capable of serving multiple users.