1. Field of the Invention
The invention relates generally to the fields of computer science, machine learning, artificial intelligence, and digital signal processing, and more specifically, to a combination of machine learning and digital signal processing that allows a computer to learn an audio bitstream, also known as “machine listening”. The machine listening technique allows the computer to extract artist and/or genre information from a piece of music with which it may not have previously experienced, but whose characteristics it has gleaned by exposure to other pieces of the same artist and/or genre.
2. Description of the Prior Art
Song recognition systems can be a boon to copyright holders who need to check if their work is being distributed criminally or via unauthorized means. The present downside of such systems is that they must include a representation of each song owned, which for large record labels and song collections can be all but impossible. Queries made within this search space could feasibly take too much processing time. Further, new songs would have to be added manually.
A system that could identify the artist of an unknown piece of music, instead of identifying a particular song, solves these problems since the copyright holder need only provide a few representative samples of each artist. The system will ‘infer’ the set of features that makes this artist unique and classifiable. Future classifications happen on completely unknown music and match very quickly to a previously-learned feature space.
There is a recent spate of interest in music retrieval from both the frequency (e.g., recorded or digital audio) and score (e.g., notes, transcriptions) domains. Most music retrieval research efforts deal with classification of instrument type, and in some cases genre.
Foote gives an overview of audio information retrieval from the starting point of speech recognition and retrieval, referring to the music domain as “a large and extremely variable audio class.” See, “An Overview of Audio Information Retrieval,” ACM-Springer Multimedia Systems, 1998.
As well, Herrera et al. overviews the various classification techniques used on digital audio. See, “Towards Instrument Segmentation for Music Content Description: a Critical Review of Instrument Classification Techniques,” Proceedings of ISMIR 2000, Plymouth, Mass. Herrera, et al., discusses such methods as K-nearest neighbor (K-NN), naive Bayesian classifiers, discriminant analysis, neural networks and support vector machines (SVM).
Tzanetakis, et al., discuss their audio retrieval system MARSYAS, which operates on various representations of audio to predict genre and classify music from speech. See, “Audio Information Retrieval (AIR) Tools,” Proceedings of ISMIR 2000, Plymouth, Mass. The classification techniques used include Gaussian Mixture Models and K-nearest neighbor algorithms.
Martin, et al., also use K-NN for instrument identification, with an overall identification success rate of 70%. See, “Musical Instrument Identification: A Pattern-Recognition Approach,” Presented at the 136th Meeting of the Acoustical Society of America, 1998.
Foote discusses using dynamic programming to retrieve orchestral music by similarity, starting with an energy representation composed of the peak RMS value of each one-second slice of music. See, “ARTHUR: Retrieving Orchestral Music by Long-Term Structure,” Proceedings of ISMIR 2000, Plymouth, Mass. He then moves onto a purely spectral feature space. The dynamic programming methods used to retrieve similar music (different performances of the same piece, for example) proved adequate for the small corpus used. Music data as lexemes for search and classification is also discussed at the score level in Pickens. See, “A Comparison of Language modeling and Probabilistic Text Information Retrieval Approaches to Monophonic Music Retrieval,” Proceedings of ISMIR 2000, Plymouth, Mass.
Logan describes the use of MFCCs for the task of music modeling. See, “Mel Frequency Cepstral Coefficients for Music Modeling,” Proceedings of ISMIR 2000, Plymouth, Mass. The MFCC is presented as a smarter Fast Fourier Transform (FFT), in that it is scaled to a more psycho-acoustically sound frequency growth, and also has a built-in discrete cosine transform (DCT) step to approximate principal component analysis for de-correlation. Her experiments in music/not music separation show promise for the MFCC in high-level music retrieval.
There are a few commercially oriented music recommending and copyright protection systems that operate on the spectral features of music. Such examples are Moodlogic (at www.Moodlogic.com), whose information is limited to select features of music, and the song-recognition components used by Relatable Technologies (www.relatable.com), cannot infer artist information from its feature space.