There is growing consumer demand for embedded automatic speech recognition (ASR) in mobile electronic devices, such as mobile phones, dictation machines, PDAs (personal digital assistants), mobile games consoles, etc. For example, email and text message dictation, note taking, form filling, and command and control applications are all potential applications of embedded ASR. However, when a medium to large vocabulary is required, effective speech recognition for mobile electronic devices has many difficulties not associated with speech recognition systems in hardware systems such as personal computers or workstations. Firstly, the available power in mobile systems is often supplied by battery, and may be severely limited. Secondly, mobile electronic devices are frequently designed to be as small as practically possible. Thus, the memory and resources of such mobile embedded systems tends to be very limited, due to power and space restrictions. The cost of providing extra memory and resources in a mobile electronic device is typically much higher than that for a less portable device without this space restriction. Thirdly, the mobile hardware may be typically used in a noisier environment than that of a fixed computer, e.g. on public transport, near a busy road, etc. Thus, a more complex speech model and more intensive computation may be required to obtain adequate speech recognition results.
Recognizing an utterance in a speech signal requires searching a two dimensional grid of frame indexes (time) and potential models (words) to determine the best path that matches the utterance and the models. The best path, i.e., the path with the highest probability, determines the result of the recognition. The Viterbi time synchronous search algorithm based on dynamic programming is widely used in ASR. However, if this search algorithm is used without modification, the search space is very large and thus requires a lot of memory. This is because at every frame, the path information associated with nearly all possible models/states must be memorized. Further, the large search space requires a lot of Gaussian computation to compute the likelihood of a frame with all active hypotheses at all active paths, and performance may be unacceptable.
Pruning is introduced to this process to reduce the size of the search space by constraining the search to models with high probability. Pruning approaches typically result in significant reduction in memory requirements and CPU time. The challenge for pruning is how to significantly reduce the search space without pruning the correct search path out of the search space. When resources are not critical, usually a safe pruning with a large margin is applied. Currently, the two primary pruning technologies used in ASR are beam pruning and rank pruning. Both pruning techniques try to pick the most likely hypotheses, i.e., the hypothesized model, at each frame. With beam pruning, the best hypothesis is first identified. Each of the other hypotheses is then compared with this best hypothesis and if the difference between the two is below a predetermined threshold, TH, i.e., the beam width, the hypothesis survives the pruning; otherwise, it is pruned out of the search process. In rank pruning, a predetermined number, N, of surviving hypotheses is chosen and at every frame and only these top N hypotheses are retained.
While both beam pruning and rank pruning both reduce the memory requirements and computations needed to perform the searches, better pruning techniques for ASR are desirable to make efficient use of the limited memory and CPU resources of mobile electronic devices.