1. Field of the Invention
The present invention relates to a method for identifying changes in an audio signal which may include music, speech, or a combination or music and speech. More particularly, the present invention relates to the identification of changes in the audio for the purpose of indexing, summarizing, beat tracking, or retrieving.
2. Description of the Related Art
With video signals, frame-to-frame differences provide a useful measure of overall changes or novelty in the video signal content. Frame-to-frame differences can be used for automatic segmentation and key frame extraction, as well as for other purposes.
A similar measure for determining significant changes or novelty points in audio might have a number of useful applications. But, computing audio changes or boundaries is significantly more difficult than video. Straightforward approaches like measuring spectral differences are typically not useful because too many false alarms occur, since the typical spectra for speech and music is in constant flux.
A typical approach to audio segmentation is to detect silences. Such a system is disclosed by Arons, B. in xe2x80x9cSpeechSkimmer: A system for interactively skimming recorded speech.xe2x80x9d ACM Trans. On Computer Human Interaction, 4(1):3-38, Match 1997. A procedure for detecting silence works best for speech, even though silences in the speech signal may have little or no semantic significance. Much audio such as popular music or reverberant sources, may contain no silences at all, and the silence based segmentation methods will fail.
Another approach, termed xe2x80x9cAuditory Scene Analysisxe2x80x9d tries to detect harmonically and temporally related components of sound. Such an approach is described by A. Bregman in xe2x80x9cAuditory Scene Analysis: Perceptual Organization of Soundxe2x80x9d, Bradford Books, 1994. Typically the Auditory Scene Analysis procedure works only in a limited domain, such as a small number of sustained and harmonically pure musical notes. For example, the Bregman approach looks for components in the frequency domain that are harmonically or temporally related. Typically rules are assumptions are used to define what xe2x80x9crelatedxe2x80x9d means, and the rules typically work well only in a limited domain.
Another approach uses speaker identification to segment audio by characteristics of an individual. Such a system is disclosed by Siu et al., xe2x80x9cAn Unsupervised Sequential Learning Algorithm For The Segmentation Of Speech Waveforms With Multiple Speakersxe2x80x9d, Proc. ICASSP, vol. 2, pp. 189-192, March 1992. Though a speaker identification method could be used to segment music, it relies on statistical models that must be trained from a corpus of labeled data, or estimated by clustering audio segments.
Another approach to audio segmentation operates using musical beat-tracking. In one approach to beat tracking correlated energy peaks across sub-bands are used. See Scheirer, Eric D., xe2x80x9cTempo and Beat Analysis of Acoustic Musical Signals:, J. Acoust. Soc. Am. 103(10), pp. 588-601. Another approach depends on restrictive assumptions such as the music must be in 4/4 time and have a bass drum on the downbeat. See, Gogo, M. and Y. Muraoaka, xe2x80x9cA Beat Tracking System for Acoustic Signals of Music,xe2x80x9d in Proc. ACM Multimedia 1994, San Francisco, ACM.
In accordance with the present invention a method is provided to automatically find points of change in music or audio, by looking at local self-similarity. The method can identify individual note boundaries or natural segment boundaries such as verse/chorus or speech/music transitions, even in the absence of cues such as silence.
The present invention works for any audio source regardless of complexity, does not rely on particular acoustic features such as silence, or pitch, and needs no clustering or training.
The method of the present invention can be used in a wide variety of applications, including indexing, beat tracking, and retrieving and summarizing music or audio. The method works with a wide variety of audio sources.
The method in accordance with the present invention finds points of maximum audio change by considering self-similarity of the audio signal. For each time window in the audio signal, a formula, such as a Fast Fourier Transform (FFT), is applied to determine a parameterization value vector. The self-similarity as well as cross-similarity between each of the parameterization values is determined for past and future windows. A significant point of novelty or change will have a high self-similarity in the past and future, and a low cross-similarity. The extent of the time difference between xe2x80x9cpastxe2x80x9d and xe2x80x9cfuturexe2x80x9d can be varied to change the scale of the system so that, for example, individual notes can be found using a short time extent while longer events, such as musical themes, can be identified by considering windows further into the past or future. The result is a measure of how novel the source audio is at any time.
Instances when the difference between the self-similarity and cross-similarity measures are large will correspond to significant audio changes, and provide good points for use in segmenting or indexing the audio. Periodic peaks in the difference measurement can correspond to periodicity in the music, such as rhythm, so the method in accordance with the present invention can be used for beat-tracking, that is, finding the tempo and location of downbeats in music. Applications of this method include:
Automatic segmentation for audio classification and retrieval.
Audio indexing/browsing: jump to segment points.
Audio summarization: play only start of significantly new segments.
Audio xe2x80x9cgisting:xe2x80x9d play only segment that best characterizes entire work.
Align music audio waveforms with MIDI notes for segmentation
Indexing/browsing audio: link/jump to next novel segment
Automatically find endpoints points for audio xe2x80x9csmart cut-and-pastexe2x80x9d
Aligning audio for non-linear time scale modification (xe2x80x9caudio morphingxe2x80x9d).
Tempo extraction, beat tracking, and alignment
xe2x80x9cAuto DJxe2x80x9d for concatenating music with similar tempos.
Finding time indexes in speech audio for automatic animation of mouth movements
Analysis for structured audio coding such as MPEG-4
The method in accordance with the present invention, thus, produces a time series that is proportional to the novelty of an acoustic source at any instant. High values and peaks correspond to large audio changes, so the novelty score can be thresholded to find instances which can be used as segment boundaries.