Media data may comprise representative segments that are capable of making lasting impressions on listeners or viewers. For example, most popular songs follow a specific structure that alternates between a verse section and a chorus section. Usually, the chorus section is the most repeating section in a song and also the “catchy” part of a song. The position of chorus sections typically relates to the underlying song structure, and may be used to facilitate an end-user to browse a song collection.
Thus, on the encoding side, the position of a representative segment such as a chorus section may be identified in media data such as a song, and may be associated with the encoded bitstream of the song as metadata. On the decoding side, the metadata enables the end-user to start the playback at the position of the chorus section. When a collection of media data such as a song collection at a store is being browsed, chorus playback facilitates instant recognition and identification of known songs and fast assessment of liking or disliking for unknown songs in a song collection.
In a “clustering approach” (or a state approach), a song may be segmented into different sections using clustering techniques. The underlying assumption is that the different sections (such as verse, chorus, etc.) of a song share certain properties that discriminate one section from the other sections or other parts of the song.
In a “pattern matching approach” (or a sequence approach), it is assumed that a chorus is a repetitive section in a song. Repetitive sections may be identified by matching different sections of the song with one another.
Both “the clustering approach” and “the pattern matching approach” require computing a distance matrix from an input audio clip. In order to do so, the input audio clip is divided into N frames; features are extracted from each of the frames. Then, a distance is computed between every pair of frames among the total number of pairs formed between any two of the N frames of the input audio clip. The derivation of this matrix is computationally expensive and requires high memory usage, because a distance needs to be computed for each and every one of all the combinations (which means an order of magnitude of N×N times, where N is the number of frames in a song or an input audio clip therein).
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, issues identified with respect to one or more approaches should not assume to have been recognized in any prior art on the basis of this section, unless otherwise indicated.