1. Field of the Disclosure
The present disclosure relates to processing multi-modal inputs and more specifically to a tiered approach to incorporating outputs from multiple outputs from multiple classifiers, such as detecting voice activity via an audio classifier and a visual classifier.
2. Introduction
Voice activity detection (VAD) attempts to detect human voice activity. Detecting human voice activity can have multiple applications, but one specific example is to know when to engage a speech recognizer. Given a high acoustic signal-to-noise ratio (SNR), the information carried by an acoustic signal provides excellent data on which to detect voice activity. However, audio-only VAD (A-VAD) performance decreases rapidly as acoustic SNR decreases.
Much in the same way that humans detect voice activity in each other, VAD can rely on multiple modalities, such as acoustic and visual information, known as audio-visual voice activity detection (AV-VAD). However, when computer AV-VAD systems process multiple modalities, one large question is how to fuse information provided by multiple modalities. Existing AV-VAD systems address this problem via feature fusion or decision fusion, but fail to incorporate or consider features with classifier output.
Some existing approaches for fusing features extracted from multiple modalities are naïve approaches, like feature concatenation and majority voting, while others are more sophisticated and blend the responses based on acoustic SNR or feature uncertainty. However, all of these approaches either assume prior knowledge of the acoustic SNR or a predetermined model of feature uncertainty. Furthermore, all of the approaches consider only a few multimodal features and do not utilize the broad set of available information and failing to consider interactions between features.