Portable handheld devices, e.g. PDAs, smart phones, mobile phones, and portable media players, typically comprise audio and/or video rendering capabilities and have become important entertainment platforms. This development is pushed forward by the growing penetration of wireless or wireline transmission capabilities into such devices. Due to the support of media transmission and/or storage protocols, such as the HE-AAC format, media content can be continuously downloaded and stored onto the portable handheld devices, thereby providing a virtually unlimited amount of media content.
However, low complexity algorithms are crucial for mobile/handheld devices, since limited computational power and energy consumption are critical constraints. These constraints are even more critical for low-end portable devices in emerging markets. In view of the high amount of media files available on typical portable electronic devices, MIR (Music Information Retrieval) applications are desirable tools in order to cluster or classify the media files and thereby allow a user of the portable electronic device to identify an appropriate media file, e.g. an audio, music and/or video file. Low complexity calculation schemes for such MIR applications are desirable as otherwise their usability on portable electronic devices having limited computational and power resources would be compromised.
An important musical feature for various MIR applications like genre and mood classification, music summarization, audio thumbnailing, automatic playlist generation and music recommendation systems using music similarity etc. is musical tempo. Thus, a procedure for tempo determination having low computational complexity would contribute to the development of decentralized implementations of the mentioned MIR applications for mobile devices.
Furthermore, while it is common to characterize music tempo by a notated tempo on a sheet music or a musical score in BPM (Beats Per Minute), this value often does not correspond to the perceptual tempo. For instance, if a group of listeners (including skilled musicians) is asked to annotate the tempo of music excerpts, they typically give different answers, i.e. they typically tap at different metrical levels. For some excerpts of music the perceived tempo is less ambiguous and all the listeners typically tap at the same metrical level, but for other excerpts of music the tempo can be ambiguous and different listeners identify different tempos. In other words, perceptual experiments have shown that the perceived tempo may differ from the notated tempo. A piece of music can feel faster or slower than its notated tempo in that the dominant perceived pulse can be a metrical level higher or lower than the notated tempo. In view of the fact that MIR applications should preferably take into account the tempo most likely to be perceived by a user, an automatic tempo extractor should predict the most perceptually salient tempo of an audio signal.
Known tempo estimation methods and systems have various drawbacks. In many cases they are limited to particular audio codecs, e.g. MP3, and cannot be applied to audio tracks which are encoded with other codecs. Furthermore, such tempo estimation methods typically only work properly when applied on western popular music having simple and clear rhythmical structures. In addition, the known tempo estimation methods do not take into account perceptual aspects, i.e. they are not directed at estimating the tempo which is most likely perceived by a listener. Finally, known tempo estimation schemes typically work in only one of an uncompressed PCM domain, a transform domain or a compressed domain.
It is desirable to provide tempo estimation methods and systems which overcome the above mentioned shortcomings of known tempo estimation schemes. In particular, it is desirable to provide tempo estimation which is codec agnostic and/or applicable to any kind of musical genre. In addition, it is desirable to provide a tempo estimation scheme which estimates the perceptually most salient tempo of an audio signal. Furthermore, a tempo estimation scheme is desirable which is applicable to audio signals in any of the above mentioned domains, i.e. in the uncompressed PCM domain, the transform domain and the compressed domain. It is also desirable to provide tempo estimation schemes with low computational complexity.
The tempo estimation schemes may be used in various applications. Since tempo is the fundamental semantic information in music, a reliable estimate of such tempo will enhance the performance of other MIR applications, such as automatic content-based genre classification, mood classification, music similarity, audio thumbnailing and music summarization. Furthermore, a reliable estimate for perceptual tempo is a useful statistic for music selection, comparison, mixing, and playlisting. Notably, for an automatic playlist generator or a music navigator or a DJ apparatus, the perceptual tempo or feel is typically more relevant than the notated or physical tempo. In addition, a reliable estimate for perceptual tempo may be useful for gaming applications. By way of example, soundtrack tempo could be used to control the relevant game parameters, such as the speed of the game or vice-versa. This can be used for personalizing the game content using audio and for providing users with enhanced experience. A further application field could be content-based audio/video synchronization, where the musical beat or tempo is a primary information source used as the anchor for timing events.
It should be noted that in the present document the term “tempo” is understood to be the rate of the tactus pulse. This tactus is also referred to as the foot tapping rate, i.e. the rate at which listeners tap their feet when listening to the audio signal, e.g. the music signal. This is different from the musical meter defining the hierarchical structure of a music signal.