Most existing music retrieval methods analyze a spectrogram and may be classified into two types: methods based on extreme points and methods based on texture analysis.
For music retrieval methods based on texture analysis, a music clip is first transformed using a short-time Fourier transform to generate a spectrogram, and the spectrogram is divided into 32 sub-bands. A gradient polarity of adjacent sub-bands is calculated. In this way, the original signal is compressed into a compact binary encoding, and a hash table is used for accelerating retrieval. Music retrieval methods based on texture analysis lack robustness to block noise, are higher in calculation complexity, and have longer retrieval times. A method is needed that is more robust to block noise, and faster in retrieval speed, such as a method based on extreme points.
For methods based on extreme points, a music clip is first transformed using a short-time Fourier transform to generate a spectrogram, and a maximum value point in the spectrogram is detected. A hash table is generated according to the frequency and time differences between adjacent extreme point pairs. During retrieval, a corresponding matching point between a music clip and a music library is matched using the hash table. Next, an offset and a degree of confidence for each music clip is estimated according to a time coordinate of the matching point. The music clip with the highest degree of confidence or having a degree of confidence beyond a threshold is retrieved. However, in these methods, detection of an extreme point is relatively sensitive to random noise and “Salt-and-Pepper” noise, which can easily cause an offset in frequency and time directions. A slight offset of the extreme point may completely change the hash value, which may cause a match to be missed, and may greatly affect the accuracy of the audio information retrieval.