Performed music typically consists of notes played from a scale, such as an equal-tempered 12-tone scale. Different music notes, with their overtones, appear with different intensities and durations during the course of the performance. These tones generally span over several octaves. In harmonic and polyphonic music, a number of tones may be dominant in intensity (loudness) at one time. Time series music sound is usually digitized at some fixed sample rate such as a CD standard of 44.1 kHz. It is desirable to observe in the frequency domain music data quantitatively and accurately through spectral analysis.
Spectral analysis of sound, including music, is typically done with a Digital Fourier Transform (DFT) on the digitized signal. The aperture for DFT analysis is a time-series data of a fixed sample size. DFT spectral output is half that sample size in complex numbers, representing spectral content of the time series data. To take advantage of computational efficiency, a Fast Fourier Transform (FFT), an efficient method for some DFT computations, is usually employed. This is a well-known procedure.
The DFT/FFT approach to analyzing music for its spectral content has some disadvantages:
In a DFT, the resulting spectral components are linearly distributed into frequency bins, determined by sampling rate and sample size. To illustrate, a sample of 2,048 time series data taken at a sampling rate of 44.1 kHz are Fourier Transformed into 1,024 spectral bins equally spaced at 21.53 Hz apart. They are fixed at 0.00, 21.53, 43.07, 64.60, . . . , 22,028.47 Hz. In music, fundamental and overtones are not linearly, but rather logarithmically spaced. For example, in a 440 equal-tempered scale, starting with low E to two octaves above middle C, the tones are 82.41, 87.3, 92.5, . . . , 987.8, 1046.5 Hz. (See FIG. 1.) The Fourier spectral bins cannot be aligned with these tones, and therefore any DFT is necessarily an inexact spectral analysis for music. Also the frequency resolution of a DFT is too coarse to distinguish low tones. In the example, the two lowest music tones are separated by less than 5 Hz, but a FFT has a constant resolution of 21.53 Hz which is more than four times the low tone spacing. To improve frequency resolution using DFTs, frame size must be lengthened proportionately, widening the data gathering aperture and slowing the analysis process. With a frame size of 2,048, corresponding to an aperture time of 46.46 ms, and the analysis result is reported 21.5 times every second. Longer frames, with corresponding wider aperture, convolute the music structure being analyzed, slow the reporting rate, both of which are detrimental to analyzing rapid music. For FFTs, frame sizes are confined to powers-of-two samples, putting additional constraints to the process. Another undesirable aspect of Fourier analysis is called the Gibbs phenomenon, which causes obvious distortion at the edges of the output frame due to inappropriate boundary conditions. To minimize distortion, DFT users resort to modifying, in effect falsifying, input data in a process called “windowing” just to make the end-result “look” natural. Yet another undesirable aspect of Fourier analysis is its susceptibility to burst error, or “glitches”. Even a single “wild” erroneous point creates large perturbation in the spectrum as Fourier Transform views it as a sharp impulse function, which is rich in spectral contents.
In summary, using FFTs to analyze music suffers from poor frequency resolution for low tones. Spectral components cannot be aligned with music tones, making spectral analysis necessarily imprecise. Restricting frame size to powers-of-two samples in FFTs places further constraints. FFTs are susceptible to sizeable distortion due to glitches and the Gibbs phenomenon.