1. Field of the Invention
The present invention relates to means for indexing audio streams without any restriction on input media, and more particularly, to a method and system for classifying and indexing the audio streams to subsequently retrieve, summarize, skim and generally search the desired audio events.
2. Description of the Related Art
Speech is distinguished from music for input data segments that have been segmented by a segmentation unit on the base of homogeneity of their properties. It is expected, that all specific sound events, such as siren, applauses, explosions, shots, etc. are selected by some specific demons, as a rule, previously, if this selection is required.
Most known approaches to distinguishing speech from music are based on speech detection, while the presence of music is defined as exception, namely, if there is no feature, being essential for human speech, the sound stream is interpreted as music. Due to huge variety of music types, this way is in principle acceptable for processing of pragmatically expedient sound streams, such as radio/TV broadcast or sound tracks of movies. However, the robust music/speech distinguishing is so important in correctly operating consequent systems of speech recognition, speaker identification and music attribution, that errors originated from these approaches disturb normal functioning of these systems.
Among approaches to speech detection there are:                Determination of pitch presence in audio signal. This method is based on the specific properties of the human vocal tract. Human vocal sound may be presented as the sequence of similar audio segments that follow one another with the typical frequencies from 80 to 120 Hz.        Calculation of percentage of “low-energy” frames. This parameter is higher for speech than for music.        Calculation of spectral “flux” as the vector of modules of differences between frame-to-frame amplitudes. This value is higher for music than for speech.        Investigation of 4 Hz peaks for perceptual channels.        
All these and other approaches do not give a reliable criterion to distinguish speech from music, have a form of probabilistic recommendations that are available in certain circumstances and are not universal.
The main advantage of the invented method is high reliability to distinguish speech from music.