Individuals and organizations are rapidly accumulating large collections of audio content. As these collections grow, individuals and organizations increasingly will require systems and methods for organizing and summarizing the audio content in their collections so that desired audio content may be found quickly and easily. To meet this need, a variety of different systems and methods for summarizing and browsing audio content have been proposed. For example, a variety of different audio summarization approaches have focused on generating and browsing audio thumbnails, which are short, representative portions of original audio pieces.
In one approach for generating audio thumbnails, an audio piece is divided into uniformly spaced segments. Mel frequency cepstral coefficients (MFCCs) are computed for each segment. The segments then are clustered by thresholding a symmetric KL (Kullback-Leibler) divergence measure. The longest component of the most frequent cluster is returned as an audio thumbnail.
Another audio thumbnail based approach analyzes the structure of digital music based on a similarity matrix, which contains the results of all possible pairwise similarity comparisons between time windows in a digital audio piece. The similarity matrix is used to visualize and characterize the structure of the digital audio piece. The digital audio piece is segmented by correlating a kernel along a diagonal of the similarity matrix. Once segmented, spectral characteristics of each segment are computed. Segments then are clustered based on the self-similarity of their statistics. The digital audio piece is summarized by selecting clusters with repeated segments through the file.
In one audio thumbnail based approach, computer readable data representing a musical piece is received and an audio summary that includes the main melody of the musical piece is generated. A component builder generates a plurality of composite and primitive components representing the structural elements of the musical piece and creates a hierarchical representation of the components. The most primitive components, representing notes within the musical piece, are examined to determine repetitive patterns within the composite components. A melody detector examines the hierarchical representation of the components and uses algorithms to detect which of the repetitive patterns is the main melody of the musical piece. Once the main melody is detected, the segment of the musical data containing the main melody is provided in one or more formats. Musical knowledge rules representing specific genres of musical styles may be used to assist the component builder and melody detector in determining which primitive component patterns are the most likely candidates for the main melody.
In one known method for skimming digital audio/video (A/V) data, the video data is partitioned into video segments and the audio data is transcribed. Representative frames from each of the video segments are selected. The representative frames are combined to form an assembled video sequence. Keywords contained in the corresponding transcribed audio data are identified and extracted. The extracted keywords are assembled into an audio track. The assembled video sequence and audio track are output together.