1. Field of the Invention
The present invention relates to audio data processing and, in particular, to metadata suitable for identifying an audio piece using a description of the audio piece in the form of rhythmic pattern.
2. Description of Prior Art
Stimulated by the ever-growing availability of musical material to the user via new media and content distribution methods, an increasing need to automatically categorize audio data has emerged. Descriptive information about audio data which is delivered together with the actual content represents one way to facilitate this task immensely. The purpose of so-called metadata (“data about data”) is to for example detect the genre of a song, to specify music similarity, to perform music segmentation on the song or to simply recognize a song by scanning a data base for similar metadata. Stated in general, metadata are used to determine a relation between test pieces of music having associated test metadata and one or more reference pieces of music having corresponding reference metadata.
One way to achieve these aims using features that belong to a lower semantic hierarchy order is described in “Content-based-identification of audio material using MPEG-7 low level description”, Allamanche, E., Herre, J., Helmuth, O., Proceedings of the second annual symposium on music information retrieval, Bloomington, USA, 2001.
The MPEG-7 standard is an example for a metadata standard which has been published in recent years in order to fulfill requirements raised by the increasing availability of multimedia content and the resulting issue of sorting and retrieving this content. The ISO/IEC MPEG-7 standard takes a very broad approach towards the definition of metadata. Herein, not only hand-annotated textual information can be transported and stored but also more signal specific data that can in most cases be automatically retrieved from the multimedia content itself.
While some people are interested in an algorithm for the automated transcription of rhythmic (percussive) accompaniment in modern day popular music, others try to capture the “rhythmic gist” of a piece of music rather than a precise transcription, in order to allow a more abstract comparison of musical pieces by their dominant rhythmic patterns. Nevertheless, one is not only interested in rhythmic patterns of percussive instruments, which do not have their main focus on playing certain notes but generating a certain rhythm, but also the rhythmic information provided by so-called harmonic sustained instruments such as a piano, a flute, a clarinet, etc. can be of significant importance for the rhythmic gist of a piece of music.
Contrary to low-level tools, which can be extracted directly from the signal itself in a computationally efficient manner, but which carry little meaning for the human listener, the usage of high-level semantic information relates to the human perception of music and is, therefore, more intuitive and more appropriate for the task to model what happens when a human listener recognizes a piece of music or not.
It has been found out that the rhythmic elements of music, determined by the drum and percussive instruments, play an important role especially in contemporary popular music. Therefore, the performance of advanced music retrieval applications will benefit from using mechanisms that allow the search for rhythmic styles, particular rhythmic features or generally rhythmic patterns when finding out a relation between a test rhythmic pattern and one or more reference rhythmic patterns which are, for example, stored in a rhythmic pattern data base.
The first version of MPEG-7 audio (ISO-IEC 15938-4) does not, however, cover high-level features in a significant way. Therefore, the standardization committee agreed to extend this part of the standard. The work contributing high-level tools is currently being assembled in MPEG-7 audio amendment 2 (ISO-IEC 15938-4 AMD2). One of its features is “rhythmicpatternsDS”. The internal structure of its representation depends on the underlying rhythmic structure of the considered pattern.
There are several possibilities to obtain a state of the art rhythmic pattern. One way is to start from the time-domain PCM representation of a piece of music such as a file, which is stored on a compact disk, or which is generated by an audio decoder working in accordance with the well-known MP3 algorithm (MPEG 1 layer 3) or advanced audio algorithms such as MPEG 4 AAC. In accordance with this method described in “Further steps towards drum transcription of polyphonic music”, Dittmar, C., Uhle, C., Proceedings of the AES 116th Convention, Berlin, Germany, 2004, a classification between un-pitched classic instruments and harmonic-sustained instruments is performed. The detection and classification of percussive events is carried out using a spectrogram-representation of the audio signal. Differentiation and half-way rectification of this spectrogram-representation result in a non-negative difference spectrogram, from which the times of occurrence and the spectral slices related to percussive events are deduced.
Then, the well-known Principle Component Analysis (PCA) is applied. When one obtains principle components, which are subjected to a Non-Negative Independent Component Analysis (NNICA), as described in “Algorithms for non-negative independent component analysis”, Plumbley, M., Proceedings of the IEEE Transactions on Neuronal Networks, 14 (3), pages 534–543, 2003, which attempts to optimize a cost function describing the non-negativity of the components.
The spectral characteristics of un-pitched percussive instruments, especially the invariance of a spectrum of different notes compared to pitched instruments allows separation using an un-mixing matrix to obtain spectral profiles, which can be used to extract the spectrogram's amplitude basis, which is also termed as the “amplitude envelopes”. This procedure is closely related to the principle of Prior Sub-space Analysis (PSA), as described in “Prior sub-space analysis for drum transcription”, Fitzgerald, D., Lawlor, B. and Coyle, E. Proceedings of the 114th AES Convention, Amsterdam, Netherlands, 2003.
Then, the extracted components are classified using a set of spectral-based and time-based features. The classification provides two sources of information. Firstly, components should be excluded from the rest of the processing, which are clearly harmonically sustained. Secondly, the remaining dissonant percussive components should be assigned to pre-defined instrument classes. A suitable measure for the distinction of the amplitude envelopes is represented by the percussiveness, which is introduced in “Extraction of drum tracks from polyphone music using independent sub-space analysis”, Uhle, C., Dittmar, C., and Sporer, T., Proceedings of the Fourth International Symposium on Independent Component Analysis, Nara, Japan, 2003.
The assignment of spectral profiles to a priori trained classes of percussive instruments is provided by a k-nearest neighbor classifier with spectral profiles of single instruments from a training database. To verify the classification in cases of low reliability or several occurrences of the same instrument, additional features describing the shape of the spectral profile, e.g. centroid, spread and tunes are extracted. Other features are the center frequencies of the most prominent local peaks, their intensities, spreads and skewnesses.
Onsets are detected in the amplitude envelopes using conventional peak picking methods. The intensity of the on-set candidate is estimated from the magnitude of the envelope signal. Onsets with intensities exceeding a predetermined dynamic threshold are accepted. This procedure reduces cross-talk influences of harmonic sustained instruments as well as concurrent percussive instruments.
For extracting drum patterns, the audio signal is segmented into similar and characteristic regions using a self-similarity method initially proposed by Foote, J., “Automatic Audio Segmentation using a Measure of Audio Novelty”, Proceeding of the IEEE International Conference on Multimedia and Expo, vol. 1, pages 452–455, 2000. The segmentation is motivated by the assumption that within each region not more than one representative drum pattern occurs, and that the rhythmic features are nearly invariant.
Subsequently, the temporal positions of the events are quantized on a tatum grid. The tatum grid describes a pulse series on the lowest metric level. Tatum period and tatum phase are computed by means of a two-way mismatch error procedure, as described in “Pulse-dependent analysis of percussive music”, Gouyon, F., Herrera, P., Cano, P., Proceedings of the AES 22nd International Conference on Virtual, Synthetic and Entertainment Audio, 2002.
Then, the pattern length or bar length is estimated by searching for the prominent periodicity in the quantized score with periods equaling an integer multiple of the bar length. A periodicity function is obtained by calculating a similarity measure between the signal and its time-shifted version. The similarity between the two score representations is calculated as a weighted sum of the number of simultaneously occurring notes and rests in the score. An estimate of the bar length is obtained by comparing the derived periodicity function to a number of so-called metric models, each of them corresponding to a bar length. A metric model is defined here as a vector describing the degree of periodicity per integer multiple of the tatum period, and is illustrated as a number of pulses, where the height of the pulse corresponds to the degree of periodicity. The best match between the periodicity function derived from the input data and predefined metric models is computed by means of their correlation coefficient.
The term tatum period is also related to the term “microtime”. The tatum period is the period of a grid, i.e., the tatum grid, which is dimensioned such that each stroke in a bar can be positioned on a grid position. When, for example, one considers a bar having a 4/4 meter, this means that the bar has 4 main strokes. When the bar only has main strokes, this means that the tatum period is the time period between two main strokes. In this case, the microtime, i.e., the metric division of this bar is 1, since one only has main strokes in the bar. When, however, the bar has exactly one additional stroke between two main strokes, the microtime is two and the tatum period is the half of the period between two main strokes. In the 4/4 example, the bar, therefore, has 8 grid positions, while in the first example, the bar only has 4 grid positions.
When there are two strokes between two main strokes, the microtime is 3, and the tatum period is ⅓ of the time period between two main strokes. In this case, the grid describing one bar has 12 grid positions.
The above-described automatic rhythmic pattern extraction method results in a rhythmic pattern as shown in FIG. 2a. FIG. 2a shows one bar having a meter of 4/4, a microtime equal to 2 and a resulting size or pattern length of 4 by 2 equals 8.
A machine-readable description of this bar would result in line 20a showing a grid position from one to eight, and line 20b showing velocity values for each grid position. For the purpose of better understanding, FIG. 2a also includes a line 20c showing the main strokes 1, 2, 3, and 4 corresponding to the 4/4 meter and showing additional strokes 1+, 2+, 3+, and 4+ at grid positions 2, 4, 6, and 8.
As it is known in the art, the term “velocity” indicates an intensity value of an instrument at a certain grid position or main stroke or additional stroke (part of the bar), wherein, in the present example, a high velocity value indicates a high sound level, while a low velocity value indicates a low sound level. It is clear that the term velocity can therefore, be attributed to harmonic-sustained instruments as well as un-pitched instruments (drum or percussion). In the case of a drum, the term “velocity” would describe a measure of the velocity, a drumstick has, when hitting the drum.
In the FIG. 2a example, it becomes clear that the drum is hit at grid positions 1, 3, 5, 6, 7 with different velocities while the drummer does not hit or kick the drum at grid positions 2, 4, 8.
It should be noted here that the FIG. 2a prior art rhythmic pattern cannot only be obtained by an automatic drum description algorithm as described above, but is also similar to the well-known MIDI description of instruments as used for music synthesizers such as electronic keyboards, etc.
The FIG. 2a rhythmic pattern uniquely describes the rhythmic situation of an instrument. This rhythmic pattern is in-line with the way of playing the rhythmic pattern, i.e., a note to be played later in time is positioned after a note to be played earlier in time. This also becomes clear from the grid position index starting at a low value (1) and, after having monotonically increased, ending at a high value (8).
Unfortunately, this rhythmic pattern, although uniquely and optimally giving information to a player of the bar is not suited for an efficient data base retrieval. This is due to the fact that the pattern is quite long, and, therefore memory-consuming. Additionally and importantly, the important and not so important information in the rhythmic pattern of FIG. 2a is very well distributed over the pattern.
In the case of a 4/4 meter, it is clear that the highest importance in the beats, which are parts 1 and 3 of the bar at grid positions 1 and 5, while the second-order importance is in the so-called “off-beats” which occur at parts of the bar 2 and 4 at grid positions 3 and 7. The third-class importance is in the additional strokes between the beats/off-beats, i.e., at grid positions 2, 4, 6, and 8.
A search engine in a database using test and reference rhythmic patterns as shown in FIG. 2a, therefore has to compare the complete test rhythmic pattern to the complete reference rhythmic patterns to finally find out a relation between the test rhythmic pattern and the reference rhythmic patterns. When one bears in mind that the rhythmic pattern in FIG. 2a only describes a single bar of a piece of music, which can, for example, have 2000 bars, and when one bears in mind that the number of pieces of music in a reference data base is to be as large as possible to cover as many as possible pieces of music, one can see that the size of the data base storage can be explode to a value of the number of pieces of music multiplied by the number of bars per piece of music multiplied by the number of bits for representing a single bar rhythmic pattern.
While the storage might not be a large issue for personal computers, it can raise the size and costs of portable music processors such as music players. Additionally, the size of the rhythmic pattern in FIG. 2a becomes even more important when one tries to have a reasonable time frame for the search engine correlating a test rhythmic pattern to the reference rhythmic patterns. In case of high-end work stations having nearly unlimited computational resources, the FIG. 2a rhythmic pattern might not be too problematic. The situation, however, becomes critical, when one has limited computational resources such as in personal computers or once again, portable players, whose price has to be significantly lower than the price of a personal computer, when such an audio retrieval system is to survive on the highly competitive marketplace.