The variety of musical presentations and the number of tastes in music of the audience have grown equally in the last few years. In particular, the interest in music is growing in the population due to the rapid advances in storing and further distributing pieces of music. Thus, the digital storage has made it possible to copy pieces of music as often as one likes without loss in quality. The most prominent example for this is the CD, which has almost completely superseded records. Recently, DVDs are also becoming increasingly popular, since they do not only enable the presentation of stereo music, but also multi-channel music, i.e. the known 5.1 surround format, for example.
Previously, the main focus was on the improvement of the sound quality and in the improvement of the distribution methods. But the increasing expansion of the Internet and digital broadcasting has been accompanied by new demands for a pre-filtering of the large amounts of music data available for the individual people. In this connection, the metadata concept, i.e. providing data via music data, reaches a new dimension. While descriptive data previously have been generated manually and added to the corresponding piece of music, automatic means to objectively analyze the content of a piece of music are being developed. Standardization methods in this field are known by the keyword “MPEG-7”.
Thus, achievements of this music analysis are to be seen in an efficient music summary or in a format-independent association of metadata with pieces of music. An objective of the automatic generation of metadata also consists in the ability to extract features from the original content, which are related to the taste in music of the user. For example, it is known to use extracted features of pieces of music to train a music provision system in that it categorizes incoming music into different musical genres.
In order to specify the musical content in manageable and yet searchable manner, i.e. in order to provide data that can be read and interpreted both by humans and by machines, reference has to be made to semantically meaningful properties of the audio signal. Such properties are the tone of instruments, the melody contained in a piece, the tempo, the rhythm, or the harmony of a piece, for example. In this connection, particularly the harmony feature is of special significance, since its importance is meaningful as an indicator for a mood of a musical passage. A piece is perceived differently in terms of feeling by a listener, depending on whether it is dissonant or harmonic, or whether it is written in a major key or in a minor key. At the same time, the harmony gives hints to the structural diversity of the available music material, for example whether there are quick and unusual chord changes, or whether there are repetitive properties in the chord structure.
The automatic expansion of polyphonic notes to full chords is known from musical tone synthesis. Modern synthesizers and keyboards are capable of automatically accompanying a player by analyzing their playing in real time and by generating a bass accompaniment, for example. The rules employed by such synthesizers or keyboards may also be applied to notes recovered from polyphonic music, even if not all notes can be recovered yet due to technical imperfections, in order to finally find dominant chords in an examined piece of music.
Thus, it is one object to analyze pieces of music not already present in musical notation or as a MIDI file, but present in form or their acoustic/electric waveforms, in order to extract individual notes from the examined piece of music due to waveform present in the time domain. The objective hereof lies in the melodic transcription of polyphonic music, i.e. ultimately the generation of a complete musical notation from a time domain representation of the music, which ultimately is a series of samples, as it is stored on a CD, for example, or is present in an mp3 file in compressed/encoded manner, for example.
A musical notation of a piece of music may in a way be considered a frequency domain representation, since the piece of music is not given by a waveform in the time domain but by a series of notes or chords, i.e. several concurrent notes, which is written in the frequency domain, with the note lines here being the frequency range scale.
At the same time, a musical notation also includes, however, time information in that a note is to be played either longer or shorter due to its symbol. The musical notation does therefore not place too much importance on a pure frequency domain representation, i.e. the representation of an amplitude at a special frequency, even though amplitude information is also given. This information is, however, not specified, but generally as information, whether a portion of the piece of music, i.e. some bars or notes of a musical notation, for example, are to be played loudly (forte) or quietly (piano).
In classical music, in particular, but also in modern music, it can be assumed that—apart from percussive portions—all notes/tones lie in a predefined note raster. Thus, in a correctly played piece of music not all frequencies can be present, but only the frequencies permitted by the musical notation. In the western note scale, one octave is divided into twelve halftones. These twelve halftones are, however, not arranged at a constant spacing—with reference to the frequency. Instead, in the tempered mood, as it is known due to the “Well-Tempered Clavier” by Johann Sebastian Bach, for example, a sequence of tones is employed, which is such that the “quality” or the “Q factor” is constant for each tone. This means that a frequency value divided by the bandwidth associated with this frequency value is constant for every tone. Tones with low frequencies have small bandwidths, whereas tones with high frequencies have great bandwidths.
This “geometric” notes classification is exemplarily illustrated in FIG. 2 in the left column. The calculation rule starting from a certain minimum frequency, which has arbitrarily been assumed as 46 Hz in the example shown in FIG. 2, is shown in the left upper field of FIG. 2. It can be seen that the spacing between the tone with 46.0 Hz and the tone with 48.74 Hz, which is 2.74 Hz, is smaller than the spacing between the tone at 92.0 Hz and the tone at 86.84 Hz, which is 5.16 Hz.
These spectral coefficients also referred to as variable spectral coefficients in the classification shown in the left half of FIG. 2 thus are different from so-called constant spectral coefficients, as they are illustrated in the right half of FIG. 2.
In the constant spectral coefficients, the spacing between two spectral coefficients at the lower end of the spectrum to the upper end of the spectrum is always the same. For illustration purposes, the twelve tones in FIG. 2 are illustrated in the tempered arrangement on the left in FIG. 2 on the one hand, and in a constant arrangement with a frequency spacing of 2.74 Hz in the right column on the other hand. While the frequency spacing becomes greater and greater in the left column so that the quality of each variable spectral coefficient is equal, the quality of each constant spectral coefficient in the right column increases more and more with increasing frequency due to the growing frequency value, because the frequency spacing is identical.
From the above discussion, it becomes obvious that constant spectral coefficients, as they are provided by a Fourier transform, for example, are in contrast at least with the western sense of music.
But since a transcription is to be created from a piece of music, as a first step to a harmony analysis, often no Fourier transform but a so-called constant Q transform is employed, i.e. a transform taking into account that the quality of each variable spectral coefficient is identical. This leads to the fact that the transform is supposed to provide a frequency raster, which is no constant frequency raster, as it is shown on the right in FIG. 2, but that this transform provides a variable frequency raster, as it is shown on the left in FIG. 2. In other words, a variable transform is supposed to adapt the frequency raster, as it is shown on the left in FIG. 2, to the well-tempered note scale, for example, as forms the basis of an overwhelming number of classical and popular pieces of music.
In the technical publication “Calculation of a Constant Q Spectral Transform”, Judith, C. Brown, Journal of the Acoustical Society of America, 89 (1), pages 425-432, January 1991, a time-frequency conversion is shown, which takes into account that the scale of western music is based on a geometric spectral coefficient spacing. Such a constant Q transform may be derived from a Fourier transform, in which the logarithm is taken of the frequency axis. This “pattern” in the frequency domain is the same for all music signals with harmonic frequency components. But differences manifest themselves in the amplitudes of the components in spite of their relatively fixed positions. These amplitude differences give the tone its tone color, for example.
When the frequency axis is illustrated logarithmically, it turns out that the mapping of constant spectral coefficients to variable spectral coefficients provides too little information at low frequencies and too much information at high frequencies. The discrete short-time Fourier transform gives a constant resolution for every frequency bin, which is inversely proportional to the temporal window size. This means that a window with 1,024 samples at a sampling rate of 32,000 samples per second has a resolution of 31.3 Hz. At the lower end of a violin, for example, i.e. at the frequency G3 of 196 Hz, this resolution is 16% of the frequency. This is much greater than a 6% frequency separation for two adjacent notes, which are tuned to the same mood. At the upper end of a piano, the frequency of C8 is 4186 Hz, wherein the FFT resolution of 31.3 Hz leads to a resolution value of 0.7% of the center frequency. Thus, much too great a number of frequency coefficients is calculated by the FFT at this point in the frequency range. Mathematically, the constant Q transform is represented as follows:
      X    ⁡          [      k      ]        =            ∑              n        =        o                    N        -        1              ⁢                  ⁢                  W        ⁡                  [                      k            ,            n                    ]                    ⁢              x        ⁡                  [          n          ]                    ⁢      exp      ⁢                        {                                    -              j                        ⁢                                                  ⁢            2            ⁢                                                  ⁢            π            ⁢                                                  ⁢                          Qn              /                              N                ⁡                                  [                  k                  ]                                                              }                .            
In this equation x[n] is the n-th sample of a digitized time function to be analyzed. The digital frequency is 2 πk/N. The period in samples is N/k, and the number of analyzed cycles is equal to k. Here, W[n] indicates the window shape. The window function has the same shape for each component. Its length is, however, determined by N[k], so that it is a function of k and n.
In the technical publication “An Efficient Algorithm for the Calculation of a Constant Q Transform”, Judith C. Brown et al., Journal of the Acoustical Society of America, 92 (5), pages 2698-2701, November 1992, an efficient algorithm for calculating the previously described transform is given. At first a discrete Fourier transform is determined, which is then converted to a constant Q transform, wherein Q is the ratio of center frequency to the bandwidth. To this end, so-called kernels are calculated, which then are applied to each consecutive DFT. Thus, each component of the constant Q transform can be calculated with a few multiplications. A spectral kernel is the discrete Fourier transform of a temporal kernel, wherein a temporal kernel is given as follows:
            w      ⁡              [                  n          ,                      k            cq                          ]              ⁢          ⅇ                        -          j                ⁢                                  ⁢                  ω                      k            ?                    n                      =            K      *                        [                      n            ,                          k              cq                                ]                .                                  ⁢                              x            cq                    ⁡                      [                          k              cq                        ]                                =                  ∑                  n          =          o                          N          -          1                    ⁢                        x          ⁡                      [            n            ]                          ⁢        K        *                  [                      n            ,                          k              cq                                ]                    
As window w[n,k], a Hamming window according to the following definition is used:w└n,kcq┘=a−(1−a)cos(2πn/N└kcq┘),In this equation, α equals 25/46.
In F. J. Harris, “High-Resolution Spectral Analysis with Arbitrary Spectral Centers and Arbitrary Spectral Resolutions”, “Comput. Electr. Eng. 3”, pages 171-191, 1976, a transform with bounded Q value is used, which may also serve for music analysis. Here, at first a fast transform is calculated, in order to then again discard the frequency values with the exception of the topmost octave. Then, it is filtered, downsampled by a factor of 2, in order to finally calculate a further FFT with the same amount of points as before, which leads to twice the previous resolution. Of this result, again only the second-highest octave is retained. Then, this procedure is repeated until the lowest octave is reached. The advantage of this method is that the efficiency of the FFT is maintained, and that at the same time a variable frequency and a variable time resolution are obtained, so that one is capable of optimizing the obtained information both with respect to the frequency and with respect to the time.
It is disadvantageous in this concept that, when a larger tone space is to be calculated, nevertheless a large amount of Fourier transforms is to be calculated, wherein between each Fourier transform windowing (filtering) has to be performed anew and at the same time downsampling has to be done. This in turn means that for the lowest octave very many temporal samples are needed, whereas very few temporal samples are needed for the topmost octave. Thus, if one wishes to calculate a complete analysis, for every (small) number of samples for the topmost octave the entire pyramid, so to speak, has to be calculated through. Since most results of each FFT are further “thrown away” in this method, and since a rather significant number of overlaps with respect to the lower octaves is required in the temporal “pyramid”, this method is extremely intensive, in spite of using the indeed efficient FFT. In other words, for each octave an FFT of its own has to be calculated to obtain a complete spectrum. If one wishes to analyze a time signal completely, i.e. for example every 8 milliseconds or every 16 milliseconds, in case for example 6 octaves are to be calculated, as many as 96 (!) FFTs will be required for an excerpt of a piece of 128 milliseconds.