The present invention relates to a method that allows to extract, from a given signal, e.g. musical signal, a representation of its rhythmic structure. The invention concerns in particular a method of synthesizing sounds while performing signal analysis. In the present invention, the representation is designed so as to yield a similarity relation between item titles, e.g. music titles. Different music signals with xe2x80x9csimilarxe2x80x9d rhythms will thus have xe2x80x9csimilarxe2x80x9d representations. The invention finds application in the field of xe2x80x9cElectronic Music Distributionxe2x80x9d (EMD), in which similarity-based searching is typically effected on music catalogues. The latter are accessible via a search code, for instance, xe2x80x9cfind titles with similar rhythmxe2x80x9d.
Musical feature extraction has traditionally been considered for short musical signals (e.g. extraction of pitch, fundamental frequency, spectral characteristics). For long musical signals, such as the one considered in the present invention (typically excerpts of popular music titles), some attempts have been made to extract beats or tempo.
Reference can be made to an article on xe2x80x9cbeat and tempo inductionxe2x80x9d obtainable through the internet at: http://steplianus2.socsci.kun.nl/mmm/papers/foot-tapping-bib.html
There further exists an article concerning a working tempo induction system having the reference:Scheirer, Eric D., xe2x80x9cTempo and Beat Analysis of Acoustic Musical Signalsxe2x80x9d, J. Acoust. Soc. Am., 103(1), pp 588-601, January 1998.
Finally, there exists a PCT patent application entitled xe2x80x9cMultifeature Speech/Music Discrimination Systemxe2x80x9d, having the filing number WO 9827543A2 with Scheirer, Eric D. and Slaney Malcolm as cited inventors. Further information on this topic can be found through the internet at: (Extract of web page: http://sound.media.mit.edu/xcx9ceds/papers.html).
According to the system disclosed in the aforementioned PCT patent application, a speech/music discriminator employs data from multiple features of an audio signal as input to a classifier. Some of the feature data determined from individual frames of the audio signal, and other input data is based upon variations of a feature over several frames, to distinguish the changes in voiced and unvoiced components of speech from the more constant characteristics of music. Several different types of classifiers for labelling test points on the basis of the feature data are disclosed. A preferred set of classifiers is based upon variations of a nearest-neighbour approach, including a K-d tree spatial partitioning technique.
However, higher level musical features have not yet been extracted using fully automatic approaches. Furthermore, the rhythmic structure of a title is difficult to define precisely independently of other musical dimensions such as timbre.
A technical area relating to the above field includes the Mpeg 7 audio community, which is currently drafting a report on xe2x80x9caudio descriptorsxe2x80x9d to be included in the future Mpeg 7 standard. However, this draft is not accessible to the public at the filing date of the application. Mpeg7 concentrates on xe2x80x9clow level descriptorsxe2x80x9d, some of which may be considered in the context of the present invention (e.g. spectral centroid).
There exists an article on Mpeg 7 audio available through the internet at: http://www.iua.upf.es/xcx9cxserra/articles/cbmi99/cbmi99.html.
From the foregoing, it appears that there is a need for a method for automatically extracting an indication of the rhythmic structure, e.g. of a musical composition, reliably and efficiently.
To this end, the present invention proposes a method of extracting a rhythmic structure from a database including sounds, comprising at least the steps of
a) processing an input signal through an analysis technique, so as to select a rhythmic information contained in said input signal; and
b) synthesizing said sound while performing said analysis technique.
The above database may include percussive sounds.
Further, the processing step may comprise processing the input signal through a spectral analysis technique.
Typically, the step of sound synthesis comprises the steps of:
a) synthesizing a new percussive sound from time series of onset peaks and the input signal, and defining the new percussive sound, thereby enabling repeated iterative treatments;
b) performing the iterative treatments until the peak series cycle computed becomes the same as the preceding cycle; and
c) selecting two different time series after the input signal has been compared to all percussive sounds for peak extraction.
The method of the invention may also comprise the step of defining said rhythmic structure as time series, each of the time series representing a temporal contribution for one of percussive sounds. Suitably, this defining step is performed prior to the processing step described above.
The above method may further comprise the steps of:
a) constructing the rhythmic structure of the input signal by combining a plurality of onset time series; and
b) reducing the rhythmic information contained in the plurality of time series, thereby extracting a reduced rhythmic information for an item. Suitably, the above rhythmic-structure constructing and rhythmic-information reducing steps are carried out subsequently to the sound-synthesizing step described above.
In the above method, the rhythmic structure may be given by a numeric representation for a given item of audio signal, and the percussive sounds in said database are given in an audio signal.
Preferably, the above defining step comprises defining the rhythmic structure as a superposition of time series, each of the time series representing a temporal contribution for one of the percussive sounds in an audio signal.
Suitably, the above constructing step comprises constructing the numeric representation of a rhythmic structure of the input signal by combining a plurality of onset time series.
Suitably yet, the above reducing step comprises reducing the rhythmic information contained in the plurality of time series by analyzing correlations products thereof, thereby extracting a reduced rhythmic information for an item of audio signal.
There is also provided a method of determining a similarity relation between items of audio signals by comparing their rhythmic structures, one of the items serving as a reference for comparison, comprising the steps of determining a rhythmic structure for each item of audio signal to be compared by carrying out the above-mentioned steps, and effecting a distance measure between the items of audio signal on the basis of a reduced rhythmic information, whereby an item of audio signal within a specified distance of a reference item in terms of a specified criteria is considered to have a similar rhythm.
The above method may further comprise the step of selecting an item of audio signal on the basis of its similarity to the reference audio signal.
Further, the defining step may comprise defining each of time series as representing a temporal peak of a given percussive sounds.
Further yet, the processing step may comprise the step of peak extraction effected on the input signal.
The step of peak extraction may comprise extracting the peaks by analyzing a signal as harmonic sound and a noise.
The above-mentioned processing step may comprise the step of peak filtering.
Preferably, the step of peak filtering comprises extracting the onset time series representing occurrences of the percussive sounds in the audio signal, repeatedly until a given threshold is reached.
The step of peak filtering may further comprise comparing the audio signals to each of the percussive sounds contained in the database via a correlations analysis technique which computes a correlation function values for an audio signal and a percussive sound.
Furthermore, the step of peak filtering may comprise assessing the quality of the peak of the time series resulted, by filtering out the correlation function values under a given amplitude threshold, filtering out the peaks having an occurrence time under a given time threshold, and filtering out the peaks missing a given quality threshold, thereby producing onset time series having a peak position vector and a peak value vector.
In the inventive method, the above-mentioned processing step may comprise the step of correlations analysis.
Further, the step of correlations analysis may comprise the steps of formulating correlations products of time series, selecting a tempo value from the correlations products and scaling the tempo value.
In this method, the formulating step may comprise the steps of:
a) specifying, as input, two time series representing onset time series of two main percussive sounds in the signal;
b) providing, as an output, a set of numbers representing a reduction of the rhythmic information contained in the input series; and
c) computing the correlations products of the two time series.
Typically, the selecting step comprises selecting the tempo value representing a prominent period in the signal.
Further, the selecting step may comprise extracting a tempo value from the correlations products, whereby the prominent period is selected within a given range.
In the above inventive method, the scaling step may comprise the steps of:
a) scaling the time series according to the tempo value and the value in amplitude, thereby yielding a new set of normalized time series; and
b) trimming and/or reducing the correlations products, thereby retaining the values for each of the normalized correlation products contained in a given range.
Likewise, the scaling step may comprise scaling the time series through the correlations products.
Preferably, the step of effecting a distance measure comprises computing the two items of audio signal on the basis of an internal representation of the rhythm for each item of audio signal, thereby reducing the data computed from the correlations products to simple numbers.
The above step of effecting a distance measure may also comprise constructing the internal representation of the rhythm as follows:
a) computing a representation of the morphology for each of the time series as a set of coefficients respectively representing the contribution in the time series of a filter; and
b) applying each filter to a time series, thereby yielding given numbers for representing the rhythm.
Furthermore, the step of effecting a distance measure may comprise representing each signal by the given numbers representing the rhythm, and performing said distance measure between two signals.
In the method described above, the item of audio signal may comprise a music title, and the audio signal may comprise a musical audio signal.
Further, the percussive sounds contained in the database may comprise audio signals produced by percussive instruments,
Further yet, the two input series may respectively represent a bass drum sound and a snare sound.
According to the present invention, there is also provided a system programmed to implement the method described above, comprising a general-purpose computer and peripheral apparatuses thereof.
There is further provided a computer program product loadable into the internal memory unit of a general-purpose computer, comprising a software code unit for carrying out the steps of the inventive method described above, when said computer program product is run on a computer.