Nowadays, with the development of the Internet and digital computing devices, digital audio data such as digital music is widely used. Thousands of audio files have been recorded and transmitted through the digital world. This means that a user who wishes to search for a particular one of a large number of audio files will have great difficulty doing so simply by listening. There exists a great demand to develop an automatic audio recognition system that can automatically recognize audio data. An automatic audio recognition (AAR) system should be able to recognize an audio file by recording a short period of the audio file in a noisy environment. A typical application of this AAR system could be an automatic music identification system. By this AAR system, a recorded music segment or an existing digital music segment can be recognized for further application.
There already exist some systems in the prior art that can analyze and recognize audio data based on the audio features of the data. An example of such a system is disclosed by U.S. Pat. No. 5,918,223, entitled “Method and article of manufacture for content-based analysis, storage, retrieval and segmentation of audio information”, Thomas L. Blum et al. This system mainly depends on extracting many audio features of the audio data, such as amplitude, peak, pitch, brightness, bandwidth, MFCC (mel frequency cepstrum coefficients). These audio features are extracted from the audio data frame by frame. Then, a decision tree is used to classify and recognize the audio data.
One problem with such a system is that it requires the extraction of many features such as amplitude, peak, pitch, brightness, bandwidth, MFCC and their first derivatives from the selected audio data, and this is a complex, time-consuming calculation. For example, the main purpose of the MFCC is to mimic the function of the human ears. The process of deriving MFCC can be divided into 6 steps shown in FIG. 4(a), which are: 1) Pre-emphasis, in which the audio signal is processed to improve its signal-to-noise ratio. 2) Windowing, in which the continuous audio data is blocked into frames of 25-ms with parts of the frames of 10-ms overlapping with each other, and after dividing the data into frames, each individual frame is processed using a hamming window so as to minimize the signal discontinuities at the edge of each frame, 3) a FFT (Fast Fourier Transform) is used to convert each frame of the audio data from the time domain into the frequency domain. 4) A “Mel Scale Filter Bank” step in which a Mel scale is used to convert the spectrum of the signal to a Mel-warped spectrum. This is done without significant loss of data by passing the Fourier transformed signal through a set of band-pass filters. The filter bank has a triangular band pass frequency response, which is non-uniform in the frequency domain but uniformly distributed in the Mel-warped spectrum, 5) The logarithms of each of the Mel spectrum coefficients are then taken to reduce the coefficients whose frequencies are above 1000 Hz and magnify those with low frequencies. 6) Finally, the logarithmic Mel spectrum coefficients are converted back to the time domain by using a discrete cosine transform (DCT) to provide Mel frequency cepstrum coefficients (MFCC).
One problem associated with such a system is the effect on it of noise in the audio data. The extracted audio features in the system are very sensitive to the noise. Especially, MFCC features are very sensitive to white Gaussian noise, which is a wide band signal, which has equal energy in all frequencies. Since the Mel scale filters have wide passband at high frequency, the MFCC results at the high frequency have a low SNR. This effect will be amplified by step 5, the logarithm operation. Then, after the step 6 (i.e. the DCT operation), the MFCC features will be influenced over the whole of the time domain. White Gaussian noise always exists in the circuits of the AAR system. Also, when microphones record audio data, white Gaussian noise is added to the audio data. Furthermore, in a real situation, there is also a lot of environmental noise.
All of these noises make it hard for the AAR system to deal with the recorded data.
A further problem with the known system is that it requires a large part of the audio data file to achieve high recognition accuracy. However, in real situations, it takes a long time to record such a large part of the audio file and extract the required features from it, which makes it difficult to achieve real time recognition.
The concept of audio recognition is frequently used in the areas of speech recognition and speaker identification. Speech recognition and speaker identification are implemented by comparing speech sounds, so research on the above technology is focused on the extraction of speech sound features. A more general approach that can compare all sorts of sounds is required since the audio recognition task is quite different when the audio data is not speech. Audio features used in a speech recognition system are normally MFCC or linear predictive coding (LPC). Also, when a speech recognition system is trained using audio training data, the training data is collected using a microphone, and therefore already contains the white Gaussian noise. Thus, adaptive learning of the training data overcomes effect of the white Gaussian noise. However, in the context of an AAR system for recognizing music files, the training data is digital data having a much lower level of white Gaussian noise than the audio data which is to be recognized, so the effect of the white Gaussian noise cannot be ignored.