Repetition can be a core principle in audio recordings such as music. This is especially true for popular songs, generally marked by a noticeable repeating musical structure, over which the singer performs varying lyrics. A typical piece of popular music has generally an underlying repeating musical structure, with distinguishable patterns periodically repeating at different levels, with possible variations.
An important part of music understanding can be the identification of those patterns. To visualize repeating patterns, a two-dimensional representation of the musical structure can be calculated by measuring the similarity and/or dissimilarity between any two instants of the audio, such as in a similarity matrix. Such a similarity matrix can be built from the Mel-Frequency Cepstrum Coefficients (MFCC) (e.g., as described in Jonathan Foote, Visualizing music and audio using self-similarity, ACM Multimedia, volume 1, pages 77-80, Orlando, Fla., USA, 30 Oct.-5 Nov. 1999, which is referred to herein as “Visualizing music”), the spectrogram (e.g., as described in Jonathan Foote, Automatic audio segmentation using a measure of audio novelty, International Conference on Multimedia and Expo, volume 1, pages 452-455, New York, N.Y., USA, 30 Jul.-2 Aug. 2000, which is referred to herein as “Automatic audio segmentation”), the chromagram (e.g., as described in Mark A. Bartsch, To catch a chorus using chroma-based representations for audio thumbnailing, Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, N.Y., USA, 21-24 Oct. 2001, which is referred to herein as Bartsch), or other features such as the pitch contour (melody) (e.g., as described in Roger B. Dannenberg, Listening to “Naima”: An automated structural analysis of music from recorded audio, International Computer Music Conference, pages 28-34, Gothenburg, Sweden, 17-21 Sep. 2002, which is referred to herein as Dannenberg), depending on the application, as long as similar sounds yield similarity in the feature space in one embodiment.
The similarity matrix can then be used, for example, to compute a measure of novelty to locate relatively significant changes in the audio (e.g., as described in Automatic audio segmentation) or to compute a beat spectrum to characterize the rhythm of the audio (e.g., as described in Jonathan Foote and Shingo Uchihashi, The beat spectrum: A new approach to rhythm analysis, International Conference on Multimedia and Expo, pages 881-884, Tokyo, Japan, 22-25 Aug. 2001, which is referred to herein as “The beat spectrum”). This ability to detect relevant boundaries within the audio can be of great utility for audio segmentation and audio summarization, such as described in Automatic audio segmentation, Bartsch, and Dannenberg.
Some known music/voice separation systems first detect vocal segments using some features such as MFCCs, and then apply separation techniques such as Non-negative Matrix Factorization (e.g., Shankar Vembu and Stephan Baumann, Separation of vocals from polyphonic audio recordings, International Conference on Music Information Retrieval, pages 337-344, London, UK, 11-15 Sep. 2005, which is referred to herein as Vembu), pitch-based inference (e.g., as described in Yipeng Li and DeLiang Wang, Separation of singing voice from music accompaniment for monaural recordings, IEEE International Conference on Acoustics, Speech, and Signal Processing, 15(4):1475-1487, May 2007, which is referred to herein as “Li and Wang,” and/or in Chao-Ling Hsu and Jyh-Shing Roger Jang, On the improvement of singing voice separation for monaural recordings using the MIR-IK dataset, IEEE Transactions on Audio, Speech, and Language Processing, 18(2):310-319, February 2010, which is referred to herein as “Hsu and Jang”), and/or adaptive Bayesian modeling (e.g., as described in Alexey Ozerov, Pierrick Philippe, Frédéric Bimbot, and Rémi Gribonval, Adaptation of bayesian models for single-channel source separation and its application to voice/music separation in popular songs, IEEE Transactions on Audio, Speech, and Language Processing, 15(5): 1561-1578, July 2007, which is referred to herein as “Ozerov”).
Some of the known separation systems and methods, however, are not without drawbacks. First, some systems and methods may rely on particular or predesignated features in audio recordings in order to separate the components (e.g., music and vocals) from the recordings. If the features are not present and/or the features must first be computed in order to separate the components of the audio recording, then the components may not be able to be accurately separated. Second, some systems and methods rely on relatively complex frameworks having significant computational costs. Third, some systems and methods must be previously trained to separate components of an audio recording, such as by learning statistical models of sound sources (e.g., a model of a person's voice) from a training database.
A need exists for a system and method that can separate components with repeating patterns from an audio recording, such as a musical accompaniment from a singing voice or a periodic interference from a corrupted signal, while avoiding or reducing the impact of one or more of the above shortcomings of some known systems and methods.