Music can include many different audio characteristics such as beats, downbeats, chords, melodies and timbre. There are a number of practical applications for which it is desirable to identify these audio characteristics from a musical audio signal. Such applications include music recommendation applications in which music similar to a reference track is searched for, in Disk Jockey (DJ) applications where, for example, seamless beat-mixed transitions between songs in a playlist is required, and in automatic looping techniques.
A particularly useful application has been identified in the use of downbeats to help synchronise automatic video scene cuts to musically meaningful points. For example, where multiple video (with audio) clips are acquired from different sources relating to the same musical performance, it would be desirable to automatically join clips from the different sources and provide switches between the video clips in an aesthetically pleasing manner, resembling the way professional music videos are created. In this case it is advantageous to synchronize switches between video shots to musical downbeats.
The following terms may be useful for understanding certain concepts to be described later.    Pitch: the physiological correlate of the fundamental frequency (f0) of a note.    Chroma: musical pitches separated by an integer number of octaves belong to a common chroma (also known as pitch class). In Western music, twelve pitch classes are used.    Beat: the basic unit of time in music—it can be considered the rate at which most people would tap their foot on the floor when listening to a piece of music. The word is also used to denote part of the music belonging to a single beat. A beat is sometimes also referred to as a tactus.    Tempo: the rate of the beat or tactus pulse represented in units of beats per minute (BPM). The inverse of tempo is sometimes referred as beat period.    Bar: a segment of time defined as a given number of beats of given duration. For example, in music with a 4/4 time signature, each bar (or measure) comprises four beats.    Downbeat: the first beat of a bar or measure.    Reverberation: the persistence of sound in a particular space after the original sound is produced.
Human perception of musical meter involves inferring a regular pattern of pulses from moments of musical stress, a.k.a. accents. Accents are caused by various events in the music, including the beginnings of all discrete sound events, especially the onsets of long pitched sounds, sudden changes in loudness or timbre, and harmonic changes. Automatic tempo, beat, or downbeat estimators may try to imitate the human perception of music meter to some extent, by measuring musical accentuation, estimating the periods and phases of the underlying pulses, and choosing the level corresponding to the tempo or some other metrical level of interest. Since accents relate to events in music, accent based audio analysis refers to the detection of events and/or changes in music. Such changes may relate to changes in the loudness, spectrum, and/or pitch content of the signal. As an example, accent based analysis may relate to detecting spectral change from the signal, calculating a novelty or an onset detection function from the signal, detecting discrete onsets from the signal, or detecting changes in pitch and/or harmonic content of the signal, for example, using chroma features. When performing the spectral change detection, various transforms or filter bank decompositions may be used, such as the Fast Fourier Transform or multi-rate filter banks, or even fundamental frequency fo or pitch salience estimators.
As a simple example, accent detection might be performed by calculating the short-time energy of the signal over a set of frequency bands in short frames over the signal, and then calculating difference, such as the Euclidean distance, between every two adjacent frames. To increase the robustness for various music types, many different accent signal analysis methods have been developed.
Reverberation is a natural phenomenon and occurs when a sound is produced in an enclosed space. This may occur, for example, when a band is playing in a large room with hard walls. When a sound is produced in an enclosed space, a large number of echoes build up and then slowly decay as the walls and air absorb the sound. Rooms which are designed for music playback are usually specifically designed to have desired reverberation characteristics. A certain amount and type of reverberation makes music listening pleasing and is desirable in a concert hall, for example. However, if the reverberation is very heavy, for example, in a room which is not designed for acoustic behaviour or where the acoustic design has not been successful, music may sound smeared and unpleasing. Even the intelligibility of speech may be decreased in this kind of situation. Furthermore, reverberation decreases the accuracy of automatic music analysis algorithms such as onset detection. To improve the situation, dereverberation methods have been developed. These methods process the audio signal containing reverberation and try to cancel the reverberation effect to recover the quality of the audio signal.
The system and method to be described hereafter draws on background knowledge described in the following publications which are incorporated herein by reference.    [1] Furuya K. and Kataoka, A. Robust speech dereverberation using multichannel blind deconvolution with spectral subtraction, IEEE Trans. On Audio, Speech, and Language Processing, Vol. 15, No. 5, July 2007.    [2] Virtanen, T. Audio signal modeling with sinusoids plus noise, MSc Thesis, Tampere University of Technology, 2001. (http://www.cs.tut.fi/sgn/arg/music/tuomasv/MScThesis.pdf)    [3] Tsilfidis, A. and Mourjopoulus, J. Blind single-channel suppression of late reverberation based on perceptual reverberation modeling, Journal of the Acoustical Society of America, vol. 129, no 3, 2011.    [4] Daniel P. W. Ellis, “Beat Tracking by Dynamic Programming”, Journal of New Music Research, Vol. 36, No. 1, pp. 51-60, 2007. (http://www.ee.columbia.edu/˜dpwe/pubs/Ellis07-beattrack.pdf).    [5] Jarno Seppänen, Antti Eronen, Jarmo Hiipakka (Nokia Corporation)—U.S. Pat. No. 7,612,275 “Method, apparatus and computer program product for providing rhythm information from an audio signal” (11 Nov. 2009)    [6] Eronen, A. J. and Klapuri, A. P., “Music Tempo Estimation with k-NN regression”, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, No. 1, pp. 50-57, 2010.    [7] U.S. Pat. No. 8,265,290 (Honda Motor Co Ltd)—“Dereverberation System and Dereverberation Method”    [8] Yasuraoka, Yoshioka, Nakatani, Nakamura, Okuno, “Music dereverberation using harmonic structure source model and Wiener filter”, Proceedings of ICASSP 2010.    [9] A. Klapuri, “Multiple fundamental frequency estimation by summing harmonic amplitudes,” in Proc. 7th Int. Conf. Music Inf. Retrieval (ISMIR-06), Victoria, Canada, 2006.    [10] Eric Scheirer, Malcolm Slaney, “Construction and evaluation of a robust multifeature speech/music discriminator”, Proc. IEEE Int. Conf. on Acoustic, Speech, and Signal Processing, ICASSP-97, Vol. 2, pp. 1331-1334, 1997.