In recent years it has become more common for computers to have multicore and manycore processes, which significantly increases the speed at which complex computing tasks can be performed by performing such tasks in a parallel manner across the various processing cores. For some complex computing tasks however, such processors are limited by their memory capacity. For example, manycore graphic processing units (GPU) have a limit of 2-8 gigabyte (GB) of memory. This memory presents a limit for tasks such calculating a large-scale graph traversal where graph structures consist of many millions of arcs and models can be on the order of 100 s of GB in size or larger.
Therefore there is a need for methods that will effectively perform large-scale graph traversal on parallel processor platforms, by efficiently leveraging heterogeneous parallel computing cores.
One field in which such improved methods are needed is in the field of large vocabulary continuous speech recognition (LVCSR). As one example of such a need, voice user interfaces are rising as a core technology for next generation smart devices. To ensure a captivating user experience it is critical that the speech recognition engines used within these systems are robust, fast, have low latency and provides sufficient coverage over the extremely large vocabularies that the system may encounter. In order to obtain high recognition accuracy, state-of-the-art speech recognition systems for tasks such as broadcast news transcription[1, 2] or voice search [3, 4] may perform recognition with large vocabularies (>1 million words), large acoustic models (millions of model parameters), and extremely large language models (billions of n-gram entries). While these models can be applied in offline speech recognition tasks, they are impractical for real-time speech recognition due to the large computational cost required during decoding.
The use of statically compiled Weighted Finite State Transducer (WFST) networks, where WFSTs representing the Hidden Markov Model (HMM) acoustic model H, context model C, pronunciation lexicon L, and language model G composed as one single network, commonly known as an H-level WFST, makes it possible to perform speech recognition very efficiently [5]. However, the composition and optimization of such search networks becomes infeasible when large models are used.
On-the-fly composition is a practical alternative to performing speech recognition with a single fully composed WFST. On-the-fly composition involves applying groups of two or more sub-WFSTs in sequence, composing them as required during decoding. One common approach is to precompose HoCoL before decoding and then compose this with the grammar network G on-the-fly. On-the fly composition has been shown to be economical in terms of memory, but decoding is significantly slower than a statically compiled WFST [6].
An alternative approach for efficient WFST decoding is to perform hypothesis rescoring [3] rather than composition during search. In this approach Viterbi search is performed using HoCoGuni, and another WFST network Guni/tri is used solely for rescoring hypotheses generated from the Viterbi search process in an on-the-fly fashion. Since this algorithm allows all knowledge sources to be available from the beginning of the search this is effective for both selecting correct paths and pruning hypotheses.
With manycore graphic processing units (GPU) now a commodity resource, hybrid GPU/CPU computational architectures are a practical solution for many computing tasks. By leveraging the most appropriate architecture for each computational sub-task, significantly higher throughput can be achieved than by using either platform alone. Prior works [7, 8] have demonstrated the efficiency of using GPU processors for speech recognition and obtained significant improvements in throughput for limited vocabulary tasks [7]. The limited memory on these architectures, however, becomes a significant bottleneck when large acoustic and language models are applied during recognition, and also for other large-scale graph traversal computations. The most significant challenge for such computations is handling the extremely large language models used in modern broad-domain speech recognition systems [1, 2, 4]. These models can contain millions of unique vocabulary entries, billions of n-gram contexts, and can easily require 20 GB or more to store in memory. Even when significantly pruned these models cannot fit within the limited memory available on GPU platforms. To efficiently perform speech recognition with large acoustic and language models we have developed a hybrid GPU/CPU architecture which leverages large memory and local-cache of the CPU with the computational throughput of GPU architectures.