The general problem of automatic audio classification has been actively studied in the literature. For example, Guo et al., in the article “Content-based audio classification and retrieval by support vector machines” (IEEE Transactions on Neural Networks, Vol. 14, pp. 209-215, 2003), have proposed a method for classifying audio signals using a set of trained support vector machines with a binary tree recognition strategy. However, most previous work has been directed toward analyzing recordings of sounds with little background interference or device variance, and do not perform well in the presence of background noise.
Other research, such as the work described by Tzanetakis et al. in the article “Musical genre classification of audio signals” (IEEE Transactions on Speech and Audio Processing, Vol. 10, pp. 293-302, 2002), has been restricted to music genre classification. The approaches developed for classifying music are generally not well-suited or robust for use with more general types of audio signals, particularly with audio signals including a mixture of different sounds in the presence of background noise.
For multimedia surveillance, some methods have been developed to identify individual audio events. For example, the work of Valenzise et al., in the article “Scream and gunshot detection and localization for audio surveillance systems” (IEEE Conference on Advanced Video and Signal Based Surveillance, 2007), uses a microphone array to locate the identified audio scream and gunshot. Atrey et al., in the article “Audio based event detection for multimedia surveillance” (IEEE International Conference on Acoustics, Speech and Signal Processing, 2006), disclose a method for hierarchically classifying audio events. Eronen et al., in the article “Audio-based context recognition” (IEEE Trans. On Audio, Speech and Language Processing, 2006), describe a method for classifying the context or environment of an audio device. Whether these methods use single or multiple microphones, they are adapted to identify individual audio events independently. That is, each audio event is independently detected from the background noise. In the case where there are multiple audio events of interest occurring together, the performance of these methods will suffer.
Chang et al., in the article “Large-scale multimodal semantic concept detection for consumer video” (Proc. International Workshop on Multimedia Information Retrieval, pp. 255-264, 2007), describe a method for detecting semantic concepts in video clips using both audio and visual signals.
There remains a need for an audio-based classification method that is more reliable and more efficient for general types of audio signals where there can be mixed sounds from multiple sound sources with severe background noises.