In the past decade relatively large amounts of multimedia data such as text, images, video, and audio, have become available. Efficient organization and manipulation of this data is frequently required for many tasks, such as for example, data classification for storage or navigation purposes, differential processing based on content, searching for specific information, among others.
A substantial portion of the data is audio originating from sources such as broadcasting channels, databases, Internet streams, commercial CDs, and the like. Responsive to a fast-growing demand for handling of the data, a relatively new field of research known as audio content analysis (ACA), or machine listening, has recently emerged. With ACA, it is possible to analyze the audio data and extract content information directly from the acoustic signal, to the point of creating a “Table of Contents” of the audio data.
Audio data (for example from broadcasting) often contains alternating portions of different types or classes of audio contents, such as for example speech and music. Generally, one of the fundamental tasks in manipulating such data is speech/music classification and segmentation, which is often a first step in processing the data. Such preprocessing may be desirable for applications requiring, for example, accurate demarcation of speech such as in automatic transcription of broadcast news, speech and speaker recognition, word or phrase spotting, and the like. Similarly, it is useful in applications involving classification of music types, for example, such as genre-based or mood-based classification. Audio content classification may also be of importance for use in applications that apply differential processing to audio data, such as content-based audio coding and compressing, or automatic equalization of speech and music. In a further example, audio content classification can also serve for indexing other data, for example, classification of video content through the accompanying audio.
One of the challenges in speech/music classification is characterization of the music signal. Speech is generally characterized by a group of relatively characteristic and well-defined sounds and as such, may be represented by relatively non-complex models. On the other hand, the assortment of sounds in music is much broader and less definite. Music can represent sounds produced by a variety of instruments, and frequently, produced by many sources simultaneously. As such, devising a model to accurately represent and encompass all kinds of music is relatively complex and may be difficult to achieve. Furthermore, the music may include superimposed speech (or speech may include superimposed music), making the model even more complex. As a result, many of the algorithmic solutions developed for speech/music classification are usually adapted to a specific application intended to be served.
The topic of audio content classification has been studied in the past. While the applications of audio content classification may be different, many studies use similar sets of acoustic features, such as short time energy, zero-crossing rate, cepstrum coefficients, spectral roll-off spectrum centroid and “loudness”, alongside some unique features, such as “dynamism”. However, an exact combination of features used can vary greatly, as well as a size of the feature set. Different studies propose various classification algorithms, even though some popular classifiers (K-nearest neighbor, Gaussian multivariate, neural network) are often used as a basis. Furthermore, in many studies, different databases are used for training and for testing the algorithm, the training and testing databases generally being relatively small.
U.S. Pat. No. 6,901,362, “Audio Segmentation and Classification”, describes “A portion of an audio signal is separated into multiple frames from which one or more different features are extracted. These different features are used, in combination with a set of rules, to classify the portion of the audio signal into one of multiple different classifications (for example, speech, non-speech, music, environment sound, silence, etc.). In one embodiment, these different features include one or more of line spectrum pairs (LSPs), a noise frame ratio, periodicity of particular bands, spectrum flux features, and energy distribution in one or more of the bands. The line spectrum pairs are also optionally used to segment the audio signal, identifying audio classification changes as well as speaker changes when the audio signal is speech.”
US Patent Application Publication No. 2009/0006102, “Effective Audio Segmentation and Classification”, describes “A method (400) and system (200) for classifying an audio signal. The method (400) operates by first receiving a sequence of audio frame feature data, each of the frame feature data characterising an audio frame along the audio segment. In response to receipt of each of the audio frame feature data, statistical data characterising the audio segment is updated with the received frame feature data. The received frame feature data is then discarded. A preliminary classification for the audio segment may be determined from the statistical data. Upon receipt of a notification of an end boundary of the audio segment, the audio segment is classified (410) based on the statistical data.”