Concepts by means of which songs are referenced by specifying a sequence of notes are of use for many users. Everybody is familiar with the situation when you are singing the tune of a song to yourself, but, except for the tune, you can't remember the title of the song. It would be desirable to sing a tune sequence or to perform the same with a music instrument and, by means of this information, reference this very tune sequence in a music database, provided that this tune sequence is contained in the music database.
The MIDI-format (MIDI=music interface description) is a note-based standard description of music signals. A MIDI file includes a note-based description such that the start and end of a tone and/or the start of the tone and the duration of the tone are recorded as a function of time. MIDI-files may for example be read into electronic keyboards and be replayed. Of course, there are also soundcards for replaying a MIDI-file via the loudspeakers connected to the soundcard of a computer. From this it can be seen that the conversion of a note-based description, which, in its most original form, is performed “manually” by means of an instrumentalist who plays a song recorded by means of notes using a music instrument, may just as well be carried out automatically.
The contrast, however, is much more complex. Converting a music signal, which is a tune sequence that is sung, performed with an instrument, or recorded by a loudspeaker, or which is a digitized and optionally compressed tune sequence available in the form of a file, into a note-based description in the form an MIDI-file or into conventional musical notation is connected with great restrictions.
In the doctoral thesis “Using Contour as a Mid-Level Representation of Melody” by A. Lindsay, Massachusetts Institute of Technology, September 1996, a method for converting a sung music signal into a sequence of notes is described. A song has to be performed using stop consonants, i.e. as a sequence of “da”, “da”, “da”. Subsequently, the power distribution of the music signal generated by the singer will be viewed over time. Owing to the stop consonants, a clear power drop between the end of a tone and the start of the following tone may be recognized in a power-time diagram. On the basis of the power drops, the music signal is segmented such that a note is available in each segment. A frequency analysis provides the height of the sung tone in each segment, the sequence of frequencies also being referred to as pitch-contour line.
The method offers disadvantages in that it is restricted to sung inputs. When specifying a tune, the tune has to be sung by means of a stop consonant and a vocal part in the form of “da”, “da”, “da” for a segmentation of the recorded music signal to be effected. This already excludes applying the method to orchestra pieces, in which a dominant instrument plays bound notes, i.e. notes which are not separated by rests.
After a segmentation, the prior art method calculates intervals of respectively two succeeding pitch-values, i.e. pitch values, in the pitch-value sequence. This interval value will be taken as a distance measure. The resulting pitch-sequence will then be compared with reference sequences stored in a database, the minimum of a sum of squared difference amounts for all reference sequences being assumed as a solution, i.e. as a note sequence referenced in the database.
A further disadvantage of this method consists in that a pitch-tracker is used comprising octave jump errors which need to be compensated for afterwards. Further, the pitch-tracker must be fine-tuned in order to provide valid values. The method merely uses the interval distances of two succeeding pitch-values. A rough quantization of the intervals will be carried out, this rough quantization only comprising rough steps being divided up into “very large”, “large”, “constant”. By means of this rough quantization, the absolute tone settings in Hertz will get lost, as a result of which a finer determination of the tune is no longer possible.
In order to be able to carry out a music recognition it is desirable to determine from a replayed tone sequence a note-based description, for example in the form of a MIDI-file or in the form of a conventional musical notation, each note being given by tone start, tone length, and pitch.
Furthermore, it should be considered that the tune entered is not always exact. In particular, for commercial use it should be assumed that the sung note sequence may be incomplete both with respect to the pitch and with respect to the tone rhythm and the tone sequence. If the note sequence is to be performed with an instrument, it has to be assumed that the instrument might be mistuned, tuned to a different frequency fundamental tone (for example not to the standard tone A of 440 Hz but to “A” with 435 Hz). Furthermore, the instrument may be tuned in an individual key, such as for example the B-clarinet or the Es-Saxophone. Even when performing the tune with an instrument, the tune tone sequence may also be incomplete, by leaving out tones (delete), by inserting tones (insert) or by playing different (false) tones (replace). Just as well, the tempo may be varied. Moreover, it should be considered that each instrument comprises its own tone color such that a tone performed by an instrument is a mixture of fundamental tone and other frequency shares, the so-called harmonics.