Recently, deep neural networks (DNNs) and convolutional neural networks (CNNs) have shown significant improvement in automatic speech recognition (ASR) performance over Gaussian mixture models (GMMs), which were previously the state of the art in acoustic modeling. Furthermore, performance improvements can be obtained by combining one or more neural networks which differ in type (e.g., DNN, CNN, recurrent neural network, bi-directional network), input features (e.g., perceptual linear prediction (PLP), Mel-frequency cepstral coefficient (MFCC), Log-mel spectra, etc.) or modality (e.g., audio features, visual features, etc.). This combination can be done in several ways.
Feature fusion: here the feature vectors from the different feature representations are concatenated and form a single feature vector which is fed into a single neural network. The drawback here is that only one network is trained which can be suboptimal if the different feature representations or modalities are best modeled by different neural network architectures (e.g., image features are best modeled with convolutional networks whereas speech features such as PLP are best modeled with DNNs).
Model fusion: the outputs of two or more neural networks that are trained separately on the same or on different feature representations are combined together to form a single output. The combination can be done using either simple weighted linear or log-linear rules or by training a new classifier on top which performs the score combination. This classifier can also be a neural network. The drawback here is that the component neural networks whose scores are combined are trained in isolation instead of being trained such as to optimize a common objective function.
Word-level fusion: here separate automatic speech recognition (ASR) systems are trained with separate neural networks and separate feature streams, and the outputs are combined at the word sequence level by aligning the individual word sequences and by selecting the words with the most votes within each alignment bin. Like before, the drawback here is that the separate networks are trained in isolation and do not optimize a common objective function.