The proliferation of electronic media, particularly audio data available to an end user, has created a need to classify the audio data across multiple characteristics. In order for the end user to properly categorize, access, and use the audio data there is a need to classify the audio data by tempo as it would be perceived by most listeners.
In musical terminology, tempo is a descriptive audio parameter measuring the speed or pace of an audio recording. One way of measuring the speed of an audio recording is to calculate the number of beats per unit of time (e.g. beats per minute or BPM).
Most people are able to distinguish between a slow and a fast song. Many people may also possess the ability to perceive a beat within an audio recording without any formal training or study. Those who are able to perceive a beat may display this ability by tapping a foot, clapping hands, or dancing in synchronization with the beat. Most audio recordings contain more than one detectable beat rate. These rhythmic beats may be polyphonically created, meaning the beats are produced by more than one instrument or source. A person may have the ability to decipher more than one beat rate from the same audio recording and may be able to parse one musical instrument's beat from another's, and even possibly hear a back, down, or off beat. For example, a person may snap fingers to the beat of a snare drum, tap a foot to a bass drum, and slap a knee to a high-hat of an audio recording, and all of these beats may be properly detected in the manner that this would be perceived by a person.
Although an audio recording may have multiple beats and the pace of these beats may dynamically change throughout a audio recording, there generally exists one prominent, perceivable thematic tempo of an audio recording. Once determined, tempo can be a useful classification characteristic of an audio recording with a variety of applications.
Automatically determining the tempo of an audio recording can prove to be a challenging endeavour given the plurality of beats produced by a variety of sources in the recording, the dynamic nature of one specific beat at any given moment in the recording, and the requirement to efficiently deliver the tempo to an application for the intended purpose.
Conventional tempo estimation algorithms generally work by detecting significant audio events and finding periodicities of repetitive patterns in an audio signal by analyzing low-level features to estimate tempo. These estimation algorithms may estimate tempo through some combination of, or variation on: onset/event detection in the time domain; sub-band signal filtering, such as the onset of notes; or a change in either frequency or the rate of change of the spectral energy. For example, repetitive patterns in intensity levels of a bass or snare drum from a piece of audio may be detected by use of these algorithms to provide a tempo estimation. However, many of these algorithms suffer from “octave errors” errors wherein the confusion results from certain instrumentation causing a false detection of either double or half the tempo as it may be perceived by most listeners. Therefore, at times, many of these algorithms may not accurately detect the most prominent perceived tempo of an audio recording.