1. Field of the Invention
This invention relates generally to audio files that have been processed using compression algorithms, and, more particularly, to a technique for the automatic classification of the compressed audio file contents.
2. Background of the Invention
With advances in auditory masking theory, quantization techniques, and data compression techniques, lossy compression of audio files has become the processing method of choice for the storage and streaming of the audio files. Compression schemes with various degrees of complexity, compression ratios and quality have evolved. The availability of these compression schemes has driven and been driven by the internet and portable audio devices. Several large data bases of compressed audio music files exist on the internet (e.g., from online stores). On a smaller scale, compressed audio music files are present on computers and portable devices around the globe. While classification schemes exist for MIDI music files and speech files, few schemes address the problem of identification and retrieval of audio content from compressed music database files. One attempt at classification of compressed audio files is the MPEG-7 standard. This standard is directed to providing a set of low level and high level descriptors that can facilitate content indexing and retrieval.
Referring to FIG. 1, a generalized block diagram of apparatus 10 for performing audio file compression schemes is shown. The raw audio data file is applied to time domain to frequency domain transformation unit 11 and to the psycho-acoustic model unit 12. The psycho-acoustic model unit 12 provides the mechanism for processing the raw data that includes assumptions regarding how audio input is perceived by human beings. Output signals from the psycho-acoustic model unit 12 are applied to the time domain/frequency domain transformation unit 11 and to a quantization unit 15. Output signals from the time domain/frequency domain transformation unit 11 are also applied to the quantization unit 15. The output signals of the quantization unit 15 are the compressed audio files. The time domain/frequency domain transformation unit 11 transforms the raw data file in the time domain to a data file in the frequency domain. The frequency domain data is quantized in the quantization unit 15 based on masking information provided by the psycho-acoustic unit 12. The psycho-acoustic unit 12 also determines the time domain/frequency domain transformation unit 11 resolution depending on the characteristics of the input signals. As a result of the apparatus shown in FIG. 1, an audio file receives two levels of compression. The first level of compression results from the selective retention of only the important audio file components as determined by the psycho-acoustic model. The second level of compression is a file compression of the file resulting from the psycho-acoustic compression, the second level of compression shrinking the file to reduce the amount of storage space. The second level of compression typically includes the Huffman coding.
In the past, centroid and energy levels of the data in the frequency domain of MPEG (Moving Picture Experts Group) encoded files along with nearest neighbor classifiers have been used as descriptors. This system has been further enhanced by including a framework for discrimination of compressed audio files based on semi-automatic methods, the system including the ability of the user to add more audio features. In addition, a classification for MPEG1 audio and television broadcasts using class (i.e., silence, speech, music, applause based segmentation) has been proposed. A similar proposal compares GMM (Gaussian Mixture Models) and tree-based VQ (Vector Quantization) descriptors for classifying MPEG encoded data.
The data in the compressed audio files are in the form of frequency magnitudes. The entire range of frequencies audible to the human ear is divided into sub-bands. Thus the data in the compressed file is divided into sub-bands. Specifically, in the MP3 format, the data is divided into 32 sub-bands. (In addition in this format, each sub-band can be further divided into 18 frequency bands referred to as split sub-bands). Each sub-band can be treated according to its masking capabilities. (Masking capability is the ability of a particular frame of audio data to mask the audio noise resulting from compression of the data. For example, instead of encoding a signal with 16 bits, 8 bits can be used, however, resulting in additional noise.) Audio algorithms also provide flags for detection of attacks in a music piece. Because an energy calculation is already performed in the encoder, the flagging of attacks can be used as an indication of rhythm, e.g., drum beats. Drum beats form the background music in most titles in music data bases. Most audiences tend to identify the characteristics of drum beats as rhythm. Because rhythm plays an important role in identifying any music, the characteristics of compression algorithms in flagging attacks is important. In present encoders, including MP3, pre-echo conditions (i.e., a condition resulting from analyzing the audio in fixed blocks rather than a long stream) are handled by switching the window to a shorter window rather than one that would otherwise be used. In some encoders, such as ATRAC (Adaptive Transform Acoustic Coding,) pre-echo is handled by gain control in the time domain. In AAC (Advanced Audio Coding) encoders, both methods are used. Referring to FIG. 2, the attack flags in a piece of music with a periodic drum beat are illustrated. In FIG. 3, the attack flags for music pieces with the human voice but no drum beat and for music pieces such as a violin concert without drum beats in the back ground are illustrated.
Referring to FIG. 4, an example of sub-band data from the frequency domain is illustrated. This sample is taken from an MP3 file encoded at 44 kHz, 128 kbps.
The techniques implemented and proposed for classifying compressed audio files in the related art have variety of shortcomings associated therewith. The computational complexity is high in most of the schemes of the related art. Therefore, these schemes may be applicable only for music file servers and not for generic internet applications. The schemes typically are not directly applicable to compressed audio files. In addition, most of the schemes decode the compressed data back to the time domain and apply techniques that have been proven in the time domain. Thus, these schemes do not take advantage of the features and parameters already available in the compressed files. In the schemes that do make use of data in the compressed format, the frequency data alone is used and not the information available as side-information descriptors. The use of side-information descriptors eliminates a large amount of computation.
A need has therefore been felt for apparatus and an associated method having the feature that the identification and classification of compressed audio files can be implemented. It would be a further feature of the apparatus and associated method to provide for the classification and identification of compressed audio files in a relatively short period of time. It would be a still further feature of the apparatus and associated method to provide for the classification and identification of compressed audio files at least partially using parameters generated as a result of compressing the audio file. It would be a still further feature of the apparatus and associated method to generate parameters describing a compressed audio file. It would be a more particular feature of the apparatus and associated method to compare a compressed reference audio file with at least one other compressed audio file. It would be yet another particular feature of the present invention to compare parameters generated from a first compressed audio file with parameters from a second compressed audio file.