A structural analysis of records of digital audio data like e.g. audio streams, digital audio data files or the like prepares the ground for many audio processing technologies like e.g. automatic speaker verification, speech-to-text systems, audio content analysis or speech recognition. Audio content analysis extracts information concerning the nature of the audio signal directly from the audio signal itself. The information is derived from an identification of the various origins of the audio data with respect to different audio classes, such as speech, music, environmental sound and silence. In many applications like e.g. speaker recognition, speech processing or application providing a preliminary step in identifying the corresponding audio classes, a gross classification is preferred that only distinguishes between audio data related to speech events and audio data related to non-speech events.
In automatic audio analysis spoken content typically alternates with other audio content in a not foreseeable manner. Furthermore, many environmental factors usually interfere with the speech signal making a reliable identification of the speech signal extremely difficult. Those environmental factors are typically ambient noise like environmental sounds or music, but also time delayed copies of the original speech signal produced by a reflective acoustic surface between the speech source and the recording instrument. For classifying audio data so-called audio features are extracted from the audio data itself, which are then compared to audio class models like e.g. a speech model or a music model by means of pattern matching. The assignment of a subsection of the record of digital audio data to one of the audio class models is typically performed based on the degree of similarity between the extracted audio features and the audio features of the model. Typical methods include Dynamic Time Warping (DTW), Hidden Markov Model (HMM), artificial neural networks, and Vector Quantisation (VQ).
The performance of a state of the art speech and sound classification system usually deteriorates significantly when the acoustic environment for the audio data to be examined deviates substantially from the training environment used for setting up the recording data base to train the classifier. But in fact, mismatches between a training and a current acoustic environment unfortunately happen again and again.