1. Field of the Invention
The present invention relates to the audio segmentation and in particular to the analysis of pieces of music, to the individual main parts contained in the pieces of music, which may repeatedly occur in the piece of music.
2. Description of the Related Art
Music from the rock and pop area mostly consists of more or less unique segments, such as intro, stanza, refrain, bridge, outro, etc. It is the aim of the audio segmentation to detect the starting and end time instants of such segments and to group the segments according to their membership in the most important classes (stanza and refrain). Correct segmentation and also characterization of the calculated segments may be sensibly employed in various areas. For example, pieces of music from online providers, such as Amazon, Musicline, etc., may be intelligently “intro scanned”.
Most providers on the Internet limit themselves to a short excerpt from the pieces of music offered in their listening examples. In this case it would of course also make sense to offer the person interested not only the first 30 seconds or any 30 seconds but a most representative excerpt from the song. This could for example be the refrain or a summary of the song, consisting of segments belonging to the various main classes (stanza, refrain, . . . ).
A further example of application for the technique of the audio segmentation is integrating the segmentation/-grouping/marking algorithm into a music player. The information on segment beginnings and segment ends enables targeted navigating through a piece of music. By the class membership of the segments, i.e. whether a segment is a stanza, a refrain, etc., it can for example also be possible to jump directly to the next refrain or to the next stanza. Such an application is of interest for large music markets offering their customers the possibility to listen into complete albums. Thereby, the customer can do without the troublesome, searching fast-forwarding to characteristic parts in the song, which might make him in fact buy a piece of music in the end.
In the field of the audio segmentation, various approaches exist. Subsequently, the approach of Jonathan Foote and Matthew Cooper is exemplarily illustrated. This method is illustrated in FOOTE, J. T./Cooper, M. L.: Summarizing Popular Music via Structural Similarity Analysis. Proceedings of the IEEE Workshop of Signal Processing to Audio and Acoustics 2003. FOOTE, J. T./COOPER, M. L.: Media Segmentation using Self-Similar Decomposition. Proceedings of SPIE Storage and Retrieval for Multimedia Databases, Vol. 5021, pp. 167-75, Jan. 2003.
The known method of Foote is exemplarily explained on the basis of the block circuit diagram of FIG. 5. At first, a WAV file 500 is provided. In a downstream extraction block 502, feature extraction takes place, wherein the spectral coefficients as such or alternatively the mel frequency cepstral coefficients (MFCCs) are extracted as feature. Before this extraction, a short-time Fourier transform (STFT) with 0.05 seconds wide non-overlapping windows is performed with the WAV file. The MFCC features are then extracted in the spectral region. Here, it is to be pointed out that the parameterization is not optimized for compression, transfer, or reconstruction, but for audio analysis. There is a requirement in that similar audio pieces generate similar features.
The extracted features are then filed in a memory 504.
Upon the feature extraction algorithm, now a segmentation algorithm takes place, which ends in a similarity matrix, as it is illustrated in block 506. At first, however, the feature matrix is read (508) in order to then group feature vectors (510) in order to then construct a similarity matrix consisting of a distance measurement between all features, respectively, due to the grouped feature vectors. In detail, all paired combinations of audio windows are compared using a quantitative similarity measure, i.e. the distance.
The construction of the similarity matrix is illustrated in FIG. 8. In FIG. 8 the piece of music is illustrated as stream 800 of audio samples. The audio piece is, as has been detailed, windowed, wherein a first window is designated with i and a second window with j. Altogether, the audio piece has K windows, for example. This means that the similarity matrix has K rows and K columns. Then for each window i and for each window j a similarity measure to each other is calculated, wherein the calculated similarity measure or distance measure D(i,j) is input at the row or column designated by i and j, respectively, in the similarity matrix. A column thus shows the similarity of the window designated by j to all other audio windows in the piece of music. The similarity of the window j to the very first window of the piece of music would then be in the column j and in the row 1. The similarity of the window j to the second window of the piece of music would then be in the column j, but now in row 2. On the other hand, the similarity of the second window to the first window would be in the second column of the matrix and in the first row of the matrix.
It can be seen that the matrix is redundant in that it is symmetrical to the diagonal and that on the diagonal there is the similarity of the window to itself, which illustrates the trivial case of 100% similarity.
An example for a similarity matrix of a piece can be seen in FIG. 6. Here again, the completely symmetrical structure of the matrix with reference to the main diagonal can be recognized, wherein the main diagonal can be seen as a bright strip. Furthermore, it is pointed out that due to the small window lengths in comparison with the relatively rough time resolution, in FIG. 6 the main diagonal is not seen as a bright continuous line, but is only about recognizable from FIG. 6.
Hereupon, using the similarity matrix, as it is illustrated for example in FIG. 6, a kernel correlation 512 with a kernel matrix 514 is performed to obtain a novelty measure, which is also known as “novelty score”, and which could be averaged and is illustrated in smoothened form in FIG. 9. The smoothing of this novelty score is schematically illustrated in FIG. 5 by a block 516.
Hereupon, in a block 518 the segment boundaries are read out using the smoothened novelty value course, wherein local maxima in the smoothened novelty course have to be determined and, if required, shifted by a constant number of samples caused by the smoothing for this, in order to in fact obtain the correct segment boundaries of the audio piece as absolute or relative time indication.
Hereupon, as it can already be seen from FIG. 5 in a block designated with clustering, a so-called segment similarity representation or segment similarity matrix is established as shown in block 520. An example for a segment similarity matrix is illustrated in FIG. 7. The similarity matrix in FIG. 7 in principle is similar to the feature similarity matrix of FIG. 6, wherein now, however, features from windows, as in FIG. 6, are no longer used, but features from a whole segment. The segment similarity matrix has a meaning similar to the feature similarity matrix, but with a substantially rougher resolution, which is, of course, desired when considering that window lengths lie in the range of 0.05 seconds, whereas reasonably long segments lie in the range of maybe 10 seconds of a piece.
Hereupon, in a block 522, then clustering is performed, i.e. a classification of the segments into segment classes (a classification of similar segments into the same segment class), in order to then mark the segment classes found in a block 524, which is also designated as “labeling”. In the labeling, it is determined which segment class contains segments that are stanzas, that are refrains, that are intros, outros, bridges, etc.
Finally, in a block designated with 526 in FIG. 5, a music summary is established, which may for example be provided to a user in order to hear only e.g. a stanza, a refrain and the intro of a piece without redundancy.
Subsequently, it will be gone into the individual blocks in still greater detail.
As has already been explained, the actual segmentation of the piece of music takes place only when the feature matrices are generated and stored (block 504).
Subject to on the basis of which feature the piece of music is to be examined regarding its structure, the corresponding feature matrix is read out and loaded into a working memory for further processing. The feature matrix has the dimension of number of the analysis window by number of feature coefficients.
By the similarity matrix, the feature course of a piece is brought into a two-dimensional representation. For each paired combination of feature vectors, the distance measure is calculated, which is kept in the similarity matrix. For the calculation of the distance measure between two vectors, there are various possibilities, namely for example the Euclidean distance measurement and the cosine distance measurement. A result D(i,j) between the two feature vectors is stored in the i, jth element of the window similarity matrix (block 506). The main diagonal of the similarity matrix represents the course of the entire piece. Accordingly, the elements of the main diagonal result from the respective comparison of a window with itself and always have the value of the greatest similarity. In the cosine distance measurement, this is the value 1, in the simple scalar difference and the Euclidean distance this value equals 0.
For the visualization of a similarity matrix as it is illustrated in FIG. 6, each element i, j is assigned a gray scale. The gray scales are graded proportionally to the similarity values, so that the maximum similarity (the main diagonal) corresponds to the maximum similarity. By this illustration, the structure of a song may already be recognized optically due to the matrix. Regions of similar feature expression correspond to quadrants of similar brightness along the main diagonal. It is the task of the actual segmentation to find the boundaries between the regions.
The structure of the similarity matrix is important for the novelty measure calculated in the kernel correlation 512. The novelty measure develops by the correlation of a special kernel along the main diagonal of the similarity matrix. An exemplary kernel K is illustrated in FIG. 5. If this kernel matrix is correlated along the main diagonal of the similarity matrix S, and all products of the overlying matrix elements for each time instant i of the piece are summed, the novelty measure is obtained, which is exemplarily illustrated in smoothened form in FIG. 9. Preferably, not the kernel K is used in FIG. 5, but an enlarged kernel, which is additionally overlaid with a Gaussian distribution, so that the edges of the matrix move toward 0.
The selection of the prominent maxima in the novelty course is important for the segmentation. The selection of all maxima of the un-smoothened novelty course would lead to a strong over-segmentation of the audio signal.
Therefore, the novelty measure should be smoothened, namely with various filters, such as IIR filters or FIR filters.
If the segment boundaries of a piece of music are extracted, now similar segments have to be characterized as such and grouped in classes.
Foote and Cooper describe the calculation of a segment-based similarity matrix by means of a Cullback-Leibler distance. For this, on the basis of the segment boundaries acquired from the novelty course, individual segment feature matrices are extracted from the entire feature matrix, i.e. each of these matrices is a sub-matrix of the entire feature matrix. The segment similarity matrix 520 thus developed is now subjected to a singular value decomposition (SVD). Hereupon, singular values in decreasing order are obtained.
In block 526, then an automatic summary of a piece is performed on the basis of the segments and the clusters of a piece of music. For this, at first the two clusters with the greatest singular values are selected. Then the segment with the maximum value of the corresponding cluster indicator is added to this summary. This means that the summary includes a stanza and a refrain. Alternatively, also all repeated segments may be removed to ensure that all information of the piece is provided, but always exactly once.
With reference to further techniques for the segmentation/music analysis it is referred to CHU, S./LOGAN B.: Music Summary using Key Phrases. Technical Report, Cambridge Research Laboratory 2000, BARTSCH, M. A./WAKEFIELD, G. H.: To Catch a Chorus: Using Chroma-Based Representation for Audio Thumbnailing. Proceedings of the IEEE Workshop of Signal Processing to Audio and Acoustics 2001. http://musen.engin.umich.edu/papers/bartsch wakefield waspaa01final.pdf.
It is disadvantageous in the known method that the singular value decomposition (SVD) for segment class formation, i.e. for assigning segments to clusters, on the one hand is very computing-intensive, and on the other hand problematic in the judgement of the results. When the singular values are about equally large, a potentially wrong decision is taken in that the two similar singular values actually represent the same segment class and not two different segment classes.
Furthermore, it has been found out that the results obtained by the singular value decomposition become more and more problematic when there are strong similarity value differences, i.e. when a piece contains very similar portions, like stanza and refrain, but also relatively dissimilar portions, like intro, outro or bridge.
It is further problematic in the known method that it is always assumed that the cluster among the two clusters with the highest singular values, which has the first segment in the song, is the cluster “stanza” and that the other cluster is the cluster “refrain”. This procedure is based on assuming, in the known method, that a song always begins with a stanza. Experience has shown that significant labeling errors are obtained with this. This is problematic in so far as the labeling is, as it were, the “harvest” of the entire method, i.e. what the user gets to know immediately. Even if the preceding steps have been precise and intensive, everything becomes relative when at the end it is labeled wrongly, since then the trust of the user in the entire concept could suffer altogether.
At this point it is to be pointed out that in particular there is need for automatic music analysis methods, without always being able to examine and, if necessary, correct the result. Instead a method is only employable in the market when it can run automatically without any human post-correction.
It is further disadvantageous in the known concept that in the segmentation it is built upon the segmentation calculated by the singular value decomposition. In other words, this means that both the clustering and the final labeling builds upon the segmentation determined by singular value decomposition. In this way, however, the clustering and labeling, and thus also the music summary that is the actual product of the entire method for the listener, can never become better than the underlying segmentation.
If over-segmentation takes place, as it often happens in particular for kernel-correlation-based concepts, it is predicted to obtain far too many segment classes in the end which then have to be post-processed to completely remove spurious segment classes which actually do not correspond to any main part, if necessary. This “post-repair” is unfavorable in that audio information is eliminated with this. When navigating through the audio piece due to the segment classes already designated, a listener will then not be able to hear the entire audio information, since insignificant segments which actually do not correspond to any main part have been completely eliminated in this method.
Even more important, however, is the fact that an over-segmentation, which may also occur by other segmentation methods, points to the fact that the original primary segmentation was not correct. The segments, for example, of the segment class designated with “refrain” are then of different quality. A segment in which the segmentation was correct has a longer refrain, whereas another segment in which the segmentation was not correct has a shorter refrain. If the segmented representation of the audio piece is then worked with, this leads to synchronization problems and also to irritation of the user, which may even go so far that the user loses trust in the segmentation concept.