The task of categorizing sound is important in a variety of applications, including speech recognition, audio retrieval, and robotics. Consider, for example, the specific task of discriminating speech from non-speech in humanoid robot applications. This is a crucial ability for a robot, as the appropriate responses to speech and other environmental sounds are very different.
The task of discriminating speech from non-speech is difficult for several reasons. For instance, the robot may be situated in an environment that is noisy and reverberant, and/or the speaker or sound may be quite far from the robot, and/or the robot itself may contain many noise generators such as motors and fans. An important decision in any auditory application is the choice of a mathematical representation of sound. Over the past few years, there has been substantial activity and progress in understanding how sound is represented in the mammalian cortex.
One model, developed by Shamma et al. is based on recent psychoacoustic and neurophysiological knowledge about the early and central stages of the mammalian auditory system. In this model, each short (8 ms) time-slice is represented as a three-dimensional tensor in frequency, rate, and scale space. The representation is very high-dimensional: the model generally uses 128 frequencies, 12 rates and 5 scales, so each time-slice is 7,680 dimensional.
This model was used by Mesgarani et al. to discriminate speech from non-speech. In that particular application, the discrimination was done in two stages. First, dimensionality reduction was performed using a Higher-Order Singular Value Decomposition (HOSVD), an analog of the standard SVD that respects the tensor nature of the cortical representation. In the second stage, a Support Vector Machine classifier with a Gaussian kernel was applied.
Although the HOSVD with Gaussian SVM system achieved state-of-the art performance, it has a number of drawbacks, including relatively high conceptual and computational complexity. For instance, the system has several tunable parameters: the number of components to keep must be determined separately for the scale, rate, and frequency subspaces, and the bandwidth of the SVM Gaussian kernel must be chosen. In addition, the dimensionality reduction of the first stage generally produces a complex result which can be difficult to interpret.
What is needed, therefore, are sound discrimination techniques that are more accurate, conceptually simpler, and computationally simpler.