1. Technical Field
Exemplary embodiments of the present invention relate to speech recognition, and more particularly to a system and method, which reduces a number of Gaussian calculations needed to increase computational efficiency in multi-stream speech recognition tasks.
2. Description of the Related Art
Recently, there has been significant interest in the use of multi-stream hidden Markov models (HMMs) for automatic speech recognition (ASR). For example, such models have been successfully considered for multi-band ASR, separate static and dynamic acoustic feature modeling, as well as for audiovisual ASR.
In its application in audio-visual speech recognition, the multi-stream approach gives rise to an effective paradigm to fuse and model two separate information sources carried in the audio and visual observations. Specifically, it has been demonstrated that multi-stream decision fusion attains significant improvement in recognition accuracy over the state-of-the-art single-stream based fusion methods, e.g., hierarchical linear discriminant analysis (HiLDA).
However, the gain in recognition performance is achieved at the cost of higher computational complexity due to the separate statistical modeling of the two observation streams. For instance, in the audio-visual ASR system described in Potamianos et al., “Recent advances in the automatic recognition of audio-visual speech:” Proc. IEEE, 91(9): 1306-1326, 2003, the signal processing front end produces audio and visual observation vectors with 60 and 41 dimensions, respectively. In HiLDA fusion, the joint audio-visual observations of 101 dimensions are projected to a 60 dimensional audio-visual feature space, which can be modeled by single-stream HMMs with a similar number of Gaussian densities as the audio only system.
On the other hand, the multi-stream HMMs model each of the two modalities in its original feature space. Hence, the number of Gaussian components required is roughly doubled in order to preserve the same modeling resolution in the output densities. For a typical decoding algorithm, the time complexity is roughly linear with respect to the total number of Gaussians in the system. Therefore, without special treatment, an audio-visual system based on two-stream HMMs will approximately command twice the computational load as a comparable single-stream system in the recognition stage.
Effectively managing the computational load is needed for the development of real-time audio-visual ASR systems. Because visual processing is expected to take a sizeable portion of the available computing power, it becomes even more imperative to improve the efficiency of algorithms involved in the decoding process, which include likelihood computation and search.
Algorithms exist for fast evaluation of Gaussians in single-stream HMMs. One class of algorithms exploits the fact that at a given frame, only a small subset of Gaussian components in the total Gaussian pool are significant to the likelihood computations, e.g., the roadmap algorithm and the hierarchical labeling algorithm. Naturally, these algorithms may be directly applied to each individual stream in the multi-stream HMM. Moreover, the synchronized and parallel nature of the observation streams in multi-stream HMMs provides a fresh dimension to formulate new approaches to further improve computational efficiency.