In the field of music information retrieval, it is often desirable to be able to determine the degree of similarity between music files. For example, a user may have thousands of music files stored on a hard drive, and may wish to locate songs that “sound like” certain favorite songs. As another example, a Web service may wish to provide song recommendations for purchase based on the content of the music that is already stored on the user's hard drive. These examples illustrate a need to classify individual musical compositions in a quantitative manner based on highly subjective features, in order to facilitate rapid search and retrieval.
Classifying information that has subjectively perceived attributes or characteristics is difficult. When the information is one or more musical compositions, classification is complicated by the widely varying subjective perceptions of the musical compositions by different listeners. Different listeners may perceive a particular musical composition quite differently.
In the classical music context, musicologists have developed names for various attributes of musical compositions. Terms such as adagio, fortissimo, or allegro broadly describe the strength with which instruments in an orchestra should be played to properly render a musical composition from sheet music. In the popular music context, there is less agreement upon proper terminology. Composers indicate how to render their musical compositions with annotations such as brightly, softly, etc., but there is no consistent, concise, agreed-upon system for such annotations.
Musical compositions and other information are now widely available for sampling and purchase over global computer networks through online merchants such as AMAZON.COM®, BARNESANDNOBLE.COM®, CDNOW.COM®, etc. A prospective consumer can use a computer system equipped with a standard Web browser to contact an online merchant, browse an online catalog of pre-recorded music, select a song or collection of songs (“album”), and purchase the song or album for shipment direct to the consumer. In this context, online merchants and others desire to assist the consumer in making a purchase selection and desire to suggest possible selections for purchase.
A variety of classification and search approaches are now used. In one approach, a consumer selects a musical composition for listening or for purchase based on past positive experience with the same artist. This approach has a significant disadvantage in that artists often have music of widely varying types.
In another approach, a merchant classifies musical compositions into broad categories or genres. A disadvantage of this approach is that typically the genres are too broad. For example, a wide variety of qualitatively different albums and songs may be classified in the genre of “Popular Music” or “Rock and Roll.”
In still another approach, an online merchant presents a search page to a client associated with the consumer. The merchant receives selection criteria from the client for use in searching the merchant's catalog or database of available music. Normally the selection criteria are limited to song name, album title, or artist name. The merchant searches the database based on the selection criteria and returns a list of matching results to the client. The client selects one item in the list and receives further, detailed information about that item. The merchant also creates and returns one or more critics' reviews, customer reviews, or past purchase information associated with the item.
For example, the merchant may present a review by a music critic of a magazine that critiques the album selected by the client. The merchant may also present informal reviews of the album that have been previously entered into the system by other consumers. Further, the merchant may present suggestions of related music based on prior purchases of others. For example, in the approach of AMAZON.COM®, when a client requests detailed information about a particular album or song, the system displays information stating, “People who bought this album also bought . . . ” followed by a list of other albums or songs. The list of other albums or songs is derived from actual purchase experience of the system. This is called “collaborative filtering.”
However, the use of this approach by itself has a significant disadvantage, namely that the suggested albums or songs are based on extrinsic similarity as indicated by purchase decisions of others, rather than based upon objective similarity of intrinsic attributes of a requested album or song and the suggested albums or songs. A decision by another consumer to purchase two albums at the same time does not indicate that the two albums are objectively similar or even that the consumer liked both. For example, the consumer might have bought one for the consumer and the second for a third party having greatly differing subjective taste than the consumer.
Another disadvantage of this type of collaborative filtering is that output data is normally available only for complete albums and not for individual songs. Thus, a first album that the consumer likes may be broadly similar to a second album, but the second album may contain individual songs that are strikingly dissimilar from the first album, and the consumer has no way to detect or act on such dissimilarity.
Still another disadvantage of collaborative filtering is that it requires a large mass of historical data in order to provide useful search results. The search results indicating what others bought are only useful after a large number of transactions, so that meaningful patterns and meaningful similarity emerge. Moreover, early transactions tend to over-influence later buyers, and popular titles tend to self-perpetuate.
In yet another approach, digital signal processing (DSP) analysis can be used to try to match characteristics from song to song. U.S. Pat. No. 5,918,223, assigned to Muscle Fish, a corporation of Berkeley, Calif. (hereinafter the Muscle Fish Patent), describes a DSP analysis technique. The Muscle Fish Patent describes a system having two basic components, typically implemented as software running on a digital computer. The two components are the analysis of sounds (digital audio data), and the retrieval of these sounds based upon statistical or frame-by-frame comparisons of the analysis results. In that system, the process first measures a variety of acoustical features of each sound file and the choice of which acoustical features to measure is critical to the success of the process. Loudness, bass, pitch, brightness, bandwidth, and Mel frequency cepstral coefficients (MFCCs) at periodic intervals (referred to as “frames”) over the length of the sound file are measured. The per-frame values are optionally stored, for applications that require that level of detail. Next, the per-frame first derivative of each of these features is computed. Specific statistical measurements of each of these features are computed to describe their variation over time. The specific statistical measurements that are computed are the mean and standard deviation. The first derivatives are also included. This set of statistical measurements is represented as an N-vector (a vector with N elements), referred to as the rhythm feature vector for music.
Once the feature vector of the sound file has been stored in a database with a corresponding link to the original data file, the user can query the database in order to access the corresponding sound files. The database system must be able to measure the distance in N-space between two N-vectors.
The sound file database can be searched by four specific methods, enumerated below. The result of these searches is a list of sound files rank-ordered by distance from the specified N-vector, which corresponds to sound files that are most similar to the specified N-vector or average N-vector of a user grouping of songs.                1. Simile: The search is for sounds that are similar to an example sound file, or a list of example sound files.        2. Acoustical/perceptual features: The search is for sounds in terms of commonly understood physical characteristics, such as brightness, pitch and loudness.        3. Subjective features: The search is for sounds using individually defined classes. One example would be to be searching for a sound that is both “shimmering” and “rough,” where the classes “shimmering” and “rough” have been previously defined by a grouping. Classes of sounds can be created (e.g., “bird sounds,” “rock music,” etc.) by specifying a set of sound files that belong to this class. The average N-vector of these sound files will represent this sound class in N-space for purposes of searching. However, this requires ex post facto grouping of songs that are thought to be similar.        4. Onomatopoeia: Involves producing a sound similar in some quality to the sound that is being searched for. One example is to produce a buzzing sound into a microphone in order to find sounds like bees or electrical hum.        
While DSP analysis may be effective for some groups or classes of songs, it is ineffective for others, and there has so far been no technique for determining what makes the technique effective for some music and not others. Specifically, such acoustical analysis as has been implemented thus far suffers defects because 1) the effectiveness of the analysis is being questioned regarding the accuracy of the results, thus diminishing the perceived quality by the user and 2) recommendations are only generally made by current systems if the user manually types in a desired artist or song title, or group of songs from that specific Web site. Accordingly, DSP analysis, by itself, is unreliable and thus insufficient for widespread commercial or other use. Another problem with the DSP analysis is that it ignores the observed fact that oftentimes, sounds with similar attributes as calculated by a digital signal processing algorithm will be perceived as sounding very different. This is because, at present, no previously available digital signal processing approach can match the ability of the human brain for extracting salient information from a stream of data. As a result, all previous attempts at signal classification using digital signal processing techniques miss important aspects of a signal that the brain uses for determining similarity.
In addition, previous attempts at classification based on connectionist approaches, such as artificial neural networks (ANN), and Self-organizing Feature Maps (SOFM), have had only limited success classifying sounds based on similarity. This has to do with the difficulties in training ANN's and SOFM's. The amount of computing resources required to train ANN's and SOFM of the required complexity tend to be cost and resource prohibitive.
The present invention is directed to providing a system that overcomes the foregoing and other disadvantages. More specifically, the present invention is directed to a system and method for determining the similarity of musical recordings based on a perceptual metric.