Audio signal classification methods are designed under different assumptions: real-time or off-line approach, different memory and complexity requirements, etc.
For a classifier used in audio coding the decision typically has to be taken on a frame-by-frame basis, based entirely on the past signal statistics. Many audio coding applications, such as real-time coding, also pose heavy constraints on the computational complexity of the classifier.
Reference [1] describes a complex speech/music discriminator (classifier) based on a multidimensional Gaussian maximum a posteriori estimator, a Gaussian mixture model classification, a spatial partitioning scheme based on k-d trees or a nearest neighbor classifier. In order to obtain an acceptable decision error rate it is also necessary to include audio signal features that require a large latency.
Reference [2] describes a speech/music discriminator partially based on Line Spectral Frequencies (LSFs). However, determining LSFs is a rather complex procedure.
Reference [5] describes voice activity detection based on the Amplitude-Modulated (AM) envelope of a signal segment.