Automatic audio classification is a means of classifying a full audio clip or a segment of an audio clip into one of a set of predefined classes without human intervention. The predefined classes used depend on the type of application involved that requires audio classification. For example, the classes can be speech/non-speech or speech/music/cheers or the like.
There are four steps in a typical automatic audio classification system.
In the first step, pre-processing is performed on the audio bitstream to streamline all the audio clips in the audio bitstream to the same stream format.
The second step usually involves feature extraction where various audio features are extracted in time domain, spectral domain or cepstral domain. Typical audio features include Mel Frequency Cepstral Coefficients (MFCC), spectrum centroid, zero crossing rate or the like. Most classification systems resort to multiple features with complementary performances. The audio features may be combined to form a feature vector that can represent the content of a segment in an audio clip. The basic features can further undergo statistical calculation to be refined or derive additional features.
In the third step, the feature vectors are automatically classified by a set of trained classifiers.
Lastly, the fourth step, post-processing, further improves the classification results.
In conventional feature extraction, if audio content in the audio bitstream is in compressed or encoded format, the audio bitstream will usually be fully decoded into audible time domain Pulse Code Modulation (PCM) data before the extraction of audio features from the audio bitstream. However, full decoding process usually involves the conversion of frequency domain coefficients into time domain, which is time consuming and results in unnecessary computations.
For example, for an AAC (Advanced Audio Codec) encoded bitstream, Inverse Modified Discrete Cosine Transform (IMDCT) and windowing have to be performed for each frame of the bitstream to transform the frequency domain coefficients, for instance MDCT spectral coefficients, into time domain. At the last step of the conversion, neighbouring frames must be overlapped and added to restore the time domain PCM data. However, in some cases, audio features are actually calculated with frequency domain coefficients and not calculated from time domain PCM data. Hence, the time domain PCM data has to be divided into frames again and for each frame, windowing and Fast Fourier Transform (FFT) have to be applied to transform each frame to the frequency domain.
In order to avoid the unnecessary computations, it has been suggested to use MDCT spectral coefficients, which are the intermediate outputs during decoding process, instead of FFT spectral coefficients for classification. However, current useful audio features such as MFCC and spectrum centroid are derived from FFT spectral coefficients and not based on MDCT spectral coefficients, which may be because of problems in deriving audio features from MDCT spectral coefficients. Encoded bitstreams, for example an AAC bitstream, typically consist of both long window blocks and short window blocks. A long window may e.g. be represented by 1024 MDCT spectral coefficients while a short window may be represented by 128 MDCT spectral coefficients. Long window blocks achieve high frequency resolution with sacrifice in time resolution while short window blocks achieve high time accuracy with sacrifice in frequency resolution. Although AAC codec can benefit from this long/short window switching strategy to achieve optimal balance between frequency resolution and time resolution, the variance of dimension of MDCT spectral coefficients makes it difficult to consistently interpret all the blocks. Thus, derivation of audio features from MDCT spectral coefficients becomes difficult.
A need therefore exists to provide a method and system for extracting audio features from an encoded bitstream for audio classification that addresses at least one of the above-mentioned problems.