1. Field of the Invention
This invention relates generally to multimedia applications, databases and search engines, and more specifically, to a computer-based system and method for automatically generating a summary of a song.
2. Background Information
Governmental, commercial and educational enterprises often utilize database systems to store information. Much of this information is often in text format, and can thus be easily searched for key phrases or for specified search strings. Due to recent advances in both storage capacity and processing power, many database systems now store audio files in addition to the more conventional text-based files. For example, digital juke boxes that can store hundreds, if not thousands, of songs have been developed. A user can select any of these songs for downloading and/or playback. Several commercial entities have also begun selling music, such as CD-ROMs, over the Internet. These entities allow users to search for, select and purchase the CD-ROMs using a browser application, such as Internet Explorer from Microsoft Corp. of Redmond, Wash. or Netscape Navigator from America Online, Inc. of Dulles, Va.
Since there is currently no mechanism for efficiently searching the content of audio files, system administrators typically append conventional, text-based database fields, such as title, author, date, format, keywords, etc. to the audio files. These conventional database fields can then be searched textually by a user to locate specific audio files. For Internet or on-line music systems, the generation of such database fields for each available CD-ROM and/or song can be time-consuming and expensive. It is also subject to data entry and other errors.
For users who do not know the precise title or the artist of the song they are interested in, such text-based search techniques are of limited value. Additionally, a search of database fields for a given search string may identify a number of corresponding songs, even though the user may only be looking for one specific song. In this case, the user may have to listen to substantial portions of the songs to identify the specific song he or she is interested in. Users may also wish to identify the CD-ROM on which a particular song is located. Again, the user may have to listen to significant portions of each song on each CD-ROM in order to locate the particular song that he or she wants. Rather than force users to listen to the first few minutes of each song, a short segment of each song or CD-ROM could be manually extracted and made available to the user for review. The selection of such song segments, however, would be highly subjective and again would be time-consuming and expensive to produce.
Systems have been proposed that allow a user to search audio files by humming or whistling a portion of the song he or she is interested in. These systems process this user input and return the matching song(s). Viable commercial systems employing such melodic query techniques, however, have yet to be demonstrated.
One aspect of the present invention is the recognition that many songs, especially songs in the rock and popular (xe2x80x9cpopxe2x80x9d) genres, have specific structures, including repeating phrases or structural elements, such as the chorus or refrain, that are relatively short in duration. These repeating phrases, moreover, are often well known, and can be used to quickly identify specific songs. That is, a user can identify a song just by hearing this repeating phrase or element. Nonetheless, these repeating phrases often do not occur at the beginning of a song. Instead, the first occurrence of such a repeating phrase may not take place for some time, and the most memorable example of the phrase may be its third or fourth occurrence within the song. The present invention relates to a system for analyzing songs and identifying a relatively short, identifiable xe2x80x9ckey phrasexe2x80x9d, such as a repeating phrase that may be used as a summary for the song. This key phrase or summary may then be used as an index to the song.
According to the invention, the song, or a portion thereof, is digitized and converted into a sequence of feature vectors. In the illustrative embodiment, the feature vectors correspond to mel-frequency cepstral coefficients (MFCCs). The feature vectors are then processed in a novel manner in order decipher the song""s structure. Those sections that correspond to different structural elements can then be marked with identifying labels. Once the song is labeled, various rules or heuristics are applied to select a key phrase for the song. For example, the system may determine which label appears most frequently within the song, and then select at least some portion of the longest occurrence of that label as the summary.
The deciphering of a song""s structure may be accomplished by dividing the song into fixed-length segments, analyzing the feature vectors of the corresponding segments and combining like segments into clusters by applying a distortion algorithm. Alternatively, the system may employ a Hidden Markov Model (HMM) approach in which a specific number of HMM states are selected so as to correspond to the song""s labels. After training the HMM, the song is analyzed and an optimization technique is used to determine the most likely HMM state for each frame of the song.