Speech recognition attempts to label continuous audio signals with discrete labels (typically phones or words). Many properties of speech audio signals corresponding to discrete labels vary depending on the speaker, the tone of the utterance, the linguistic context of the phone or word, and other factors. Features of the spectrogram of the audio signal, however, are conserved across many of these contextual factors. Spectral information is therefore extracted by both artificial speech recognition systems and the human ear as a pre-processing step in speech perception.
The power spectrum of a short (10-50 ms) sample of an audio signal containing speech typically has at least two or three identifiable peaks, called formants. There will also be power in frequencies near these peaks; in general, this information is redundant and can be considered noise, as the formants are sufficient to differentiate most speech sounds. The power spectrum, therefore, contains both useful signals and noise correlated with that useful signal.
In order to increase the signal-to-noise ratio, the power spectrum can be decorrelated by projecting it onto a set of basic functions using inverse Fourier techniques. The coefficients on these basis functions are called “cepstral coefficients,” and are the most frequently used feature vector representations for automatic speech recognition systems.
Cepstral coefficients and other feature vectors form the “frontend” of an automatic speech recognition system. The “backend” assigns discrete phone and word labels to sequences of feature vectors using statistical techniques. Currently, artificial neural networks are the primary computational model used in the backend of successful speech recognition systems.
Spiking neural networks are a class of artificial neural networks that have seen recent success in image classification and control problems (Hunsberger and Eliasmith, 2015; DeWolf, 2015). In addition to being well-suited for continuous temporal situations, they communicate through asynchronous transmission of information packets (i.e., spikes). Asynchronous communication through a large number of simple neural units operating in parallel has been implemented in a class of hardware devices called neuromorphic systems. Neuromorphic systems simulate spiking neural networks using orders of magnitude less power than traditional computing devices.
Implementing an efficient frontend representing auditory signals and features of those signals in spiking and non-spiking networks would permit unified realization of a speech recognition system, allowing efficient systems to be built. For example, a spiking frontend can be efficiently realized in neuromorphic hardware.