In music terminology, a downbeat is the first beat or impulse of a bar (also known as a measure). It frequently, although not always, carries the strongest accent of the rhythmic cycle. The downbeat is important for musicians as they play along to the music and to dancers when they follow the music with their movement.
There are a number of practical applications in which it is desirable to identify from a musical audio signal the temporal position of downbeats. Such applications include music recommendation applications in which music similar to a reference track is searched for, in Disk Jockey (DJ) applications where, for example, seamless beat-mixed transitions between songs in a playlist is required, and in automatic looping techniques.
A particularly useful application has been identified in the use of downbeats to help synchronise automatic video scene cuts to musically meaningful points. For example, where multiple video (with audio) clips are acquired from different sources relating to the same musical performance, it would be desirable to automatically join clips from the different sources and provide switches between the video clips in an aesthetically pleasing manner, resembling the way professional music videos are created. In this case it is advantageous to synchronize switches between video shots to musical downbeats.
The following terms are useful for understanding certain concepts to be described later.
Pitch: the physiological correlate of the fundamental frequency (f0) of a note.
Chroma, also known as pitch class: musical pitches separated by an integer number of octaves belong to a common pitch class. In Western music, twelve pitch classes are used.
Beat or tactus: the basic unit of time in music, it can be considered the rate at which most people would tap their foot on the floor when listening to a piece of music. The word is also used to denote part of the music belonging to a single beat.
Tempo: the rate of the beat or tactus pulse represented in units of beats per minute (BPM).
Bar or measure: a segment of time defined as a given number of beats of given duration. For example, in a music with a 4/4 time signature, each measure comprises four beats.
Downbeat: the first beat of a bar or measure.
Accent or Accent-based audio analysis: analysis of an audio signal to detect events and/or changes in music, including but not limited to the beginning of all discrete sound events, especially the onset of long pitched sounds, sudden changes in loudness of timbre, and harmonic changes. Further detail is given below.
Human perception of musical meter involves inferring a regular pattern of pulses from moments of musical stress, a.k.a. accents. Accents are caused by various events in the music, including the beginnings of all discrete sound events, especially the onsets of long pitched sounds, sudden changes in loudness or timbre, and harmonic changes. Automatic tempo, beat, or downbeat estimators may try to imitate the human perception of music meter to some extent, by measuring musical accentuation, estimating the periods and phases of the underlying pulses, and choosing the level corresponding to the tempo or some other metrical level of interest. Since accents relate to events in music, accent based audio analysis refers to the detection of events and/or changes in music. Such changes may relate to changes in the loudness, spectrum, and/or pitch content of the signal. As an example, accent based analysis may relate to detecting spectral change from the signal, calculating a novelty or an onset detection function from the signal, detecting discrete onsets from the signal, or detecting changes in pitch and/or harmonic content of the signal, for example, using chroma features. When performing the spectral change detection, various transforms or filterbank decompositions may be used, such as the Fast Fourier Transform or multirate filterbanks, or even fundamental frequency fo or pitch salience estimators. As a simple example, accent detection might be performed by calculating the short-time energy of the signal over a set of frequency bands in short frames over the signal, and then calculating difference, such as the Euclidean distance, between every two adjacent frames. To increase the robustness for various music types, many different accent signal analysis methods have been developed.
The system and method to be described hereafter draws on background knowledge described in the following publications which are incorporated herein by reference.    [1] Peeters and Papadopoulos, “Simultaneous Beat and Downbeat-Tracking Using a Probabilistic Framework: Theory and Large-Scale Evaluation”,“IEEE Trans. Audio, Speech and Language Processing, Vol. 19, No. 6, August 2011.    [2] Eronen, A. and Klapuri, A., “Music Tempo Estimation with k-NN regression,” IEEE Trans. Audio, Speech and Language Processing, Vol. 18, No. 1, January 2010.    [3] Seppänen, Eronen, Hiipakka. “Joint Beat & Tatum Tracking from Music Signals”, International Conference on Music Information Retrieval, ISMIR 2006 and Jarno Seppinen, Antti Eronen, Jarmo Hiipakka: Method, apparatus and computer program product for providing rhythm information from an audio signal. Nokia November 2009: U.S. Pat. No. 7,612,275.    [4] Antti Eronen and Timo Kosonen, “Creating and sharing variations of a music file”—United States Patent Application 20070261537.    [5] Klapuri, A., Eronen, A., Astola, J., “Analysis of the meter of acoustic musical signals,” IEEE Trans. Audio, Speech, and Language Processing, Vol. 14, No. 1, 2006.    [6] Jehan, Creating Music by Listening, PhD Thesis, MIT, 2005. http://web.media.mit.edu/˜tristan/phd/pdf/Tristan PhD MIT.pdf    [7] D. Ellis, “Beat Tracking by Dynamic Programming”, J. New Music Research, Special Issue on Beat and Tempo Extraction, vol. 36 no. 1, March 2007, pp. 51-60. (10pp) DOI: 10.1080/09298210701653344