With the advent of MP3 and other efficient compression algorithms, the way people store and access music is changing. It is now possible to carry hundreds of hours of music in a small portable device, raising new user interface (UI) issues of how to most effectively present these songs to the user. On a home appliance or through the web, the problem is compounded since users could potentially have access to thousands or millions of hours of music.
The efficiency of these compression algorithms means that it is now feasible for radio stations to broadcast tailored content to small groups of users. Yet tailoring this content by hand as is done today for traditional radio stations is clearly infeasible. Moreover, web-based music distribution could benefit enormously by being able to automatically recommend new songs which are similar (or dissimilar) to a user's choice. Currently, this is done by hand or based on “collaborative filtering” techniques which require a large amount of data collection.
The traditional and most reliable technique of determining music similarity is by hand. Collaborative filtering techniques are an extension to this that attempt to produce personal recommendations by computing the similarity between one person's preferences and those of (usually many) other people. A number of companies have systems which rely on collaborative filtering e.g., www.launch.com allows users to set up personal radio stations where they rate songs that they hear.
Many researchers have looked at the music indexing problem by analyzing Musical Instrument Digital Interface (MIDI) music data, musical scores or using crude pitch-tracking to find a “melody contour” for each piece of music. Similar songs hopefully have similar melody contours and can be found using string matching techniques, although the problem is still not trivial (e.g., Blackburn, S. and De Roure, D., “A Tool for Content Based Navigation of Music,” ACM Multimedia 1998. McNab, R., Smith L., Witten, I., Henderson, C. and Cunningham, S., “Towards the Digital Music Library: Tune Retrieval From Acoustic Input,” in Proceedings Digital Libraries '96, pp. 11–18, 1996. Ghias, A. Logan, J., Chamberlin, D., and Smith, B., “Query by Humming—Musical Information Retrieval in an Audio Database,” in Proceedings ACM Multimedia 95, San Francisco, 1995). MIDI is a protocol describing how a piece of music is to be played on a synthesizer. It can be thought of as a set of instructions detailing each sound to be played. Conceptually, it is equivalent to having the musical score available.
Other researchers have focused on analyzing the music content directly. Blum et al. present an indexing system based on matching features such as pitch, loudness or Mel-frequency cepstral coefficients (MFCC) features of audio (Blum, T., Keislar, D., Weaton, J., Wold, E., “Method and Article of Manufacture for Content-Based Analysis, Storage, Retrieval, and Segmentation of Audio Information,” U.S. Pat. No. 5,918,223, issued on Jun. 29, 1999.) Foote has designed a music indexing system based on histograms of MFCC features derived from a discriminatively trained vector quantizer (Foote, J., “Content-Based Retrieval of Music and Audio,” Proceedings of SPIE, volume 3229, pp. 138–147, 1997.)
A more recent publication uses a technique to analyze audio based solely on content analysis (Z. Liu and Q. Huang, “Content-Based Indexing and Retrieval by Example in Audio,” presented at ICME 2000, July 2000). They investigate the problem of finding speech by a particular speaker in a one hour program. Because the show is not segmented into different segments, they first segment the data into audio with similar characteristics using the Kullback Leibler distance. They then produce a Gaussian mixture model for the MFCC features of each segment.
They then use their own distance measure to compare their “signatures” and obtain audio similar to the desired query. (Liu, Z. and Huang, Q., “A New Distance Measure for Probability Distribution Function of Mixture Types,” ICASSP 2000, May 2000). Their distance measure has been known in the vision research community for several years. (Y. Rubner, C. Tomasi, and L. Guibas. The Earth Mover's Distance as a Metric for Image Retrieval,” Technical Report STAN-CS-TN-98-86, Computer Science Department, Stanford University, September 1998.)
Finally, several startups are working in the music similarity business and claim to at least partly use content-based analysis techniques. According to their website, CantaMetrix's (http://cantametrix.com) technology “analyzes the digital waveform of a piece of music, coding songs based on characteristics such as melody, rhythm and timbre to produce a digital ‘fingerprint.’ This information is then run through a ‘psycho-acoustic model’ based on responses from about 500 people who have rated a selection of songs based on psychological factors such as ‘upbeatness’ and ‘energy.’” (See http://www.cnn.com/2000/TECH/computing/09/08/mood.music.idg/index.html). There is no demo available for this technology.
Another company called MongoMusic (http://www.mongomusic.com) has a working demo on the web that allows users to find songs which are “similar” to those requested. This company was acquired by Microsoft in September 2000 (see http://www.microsoft.com/presspass/press/2000/Sept00/MongoPR.asp). The technology was incorporated into a beta version of Microsoft MSN in April 2001 (see http://www.microsoft.com/PressPass/features/2001/apr01/04-03msnmusic. asp).
The original demo at http://www.mongomusic.com seemed to work quite well. It could return similar songs to a chosen song from a database of unknown size. (Possibly the database was of size 160000 if it's the same one referred to in http://www.forbes.com/2000/09/09/feat2_print.html. The beta version of Microsoft MSN (http://music.msn.com) appears to use MongoMusic's “sounds like” technology at the album rather than the song level.
There was some information on MongoMusic's original website about the workings of their technology. It appears to involve some human “quality assurance” after the original list of matches is returned. Here are some quotes from their press releases.
“[O]ne of the reasons the service works so well is that there is little human involvement in its Intuitive Music Search System [IMSS],” according to a spokesperson. “The differentiating factor between this and anything else that's out there at this time is that this is fundamentally based on the music itself, as opposed to being based on collaborative filtering or user preferences,” he explains. He describes the patent-pending technology as a “semi-automated, semi-human-based system.” Basically, IMSS matches songs based on musical characteristics such as tonality, rather than using pre-matched song lists. The company declines to elaborate further on its proprietary information (from http://www.thestandard.com/newletters/display/0,2098,112-160,00.html).
“The key to MongoMusic's future is a search technology that analyses music for certain attributes, such as tempo, mood, and beats-per-minute, so it can recommend similar songs that people might like,” according to a press release. “The customization is based on the analysis of massive music libraries, of which Sony is the first recording company to sign on with MongoMusic.” (From http://www.mongomusic.com/s/press_macnn—050900). Also available at http ://www.macnn.com/features/mongo.shtml.)
“A team of 35 full-time musicologists, or ‘groovers,’ looks at the computer's decisions and tweak them based on their own expertise, but they rarely reject its recommendations. The team includes Jeoff Stanfield, who plays bass in an alternative band called Black Lab, and Colt Tipton, the world's fiddling champion.”
“They may change the rankings of some tunes, or make some suggestions that are surprisingly right on—like a Beastie Boys song in the jazz category. But the computer analysis is really effective,” says Colleen Anderson, vice president of marketing at MongoMusic. From http://www.forbes.com/2000/09/09/feat2_print.html.