Multimedia content is an increasingly popular resource, supported by a surging market for personal digital music devices, an increase of bandwidth to the home and the emergence of 3G wireless devices. There is an increasing need for an effective searching mechanism for multimedia content. Though many systems exist for content-based retrieval of images, few mechanisms are available to retrieve the audio portion of multimedia content. One possibility for such mechanisms is retrieval by humming, whereby a user searches by humming melodies of a desired musical piece into a system. This incorporates a melody transcription technique.
FIG. 1 shows a flowchart for a known system of humming recognition. The melody transcription technique consists of a silence discriminator 101, pitch detector 102 and note extractor 103. It is assumed that each note will be separated by a reasonable amount of silence. This reduces the problem of segmentation to a silence detection problem.
In U.S. Pat. No. 6,188,010 a FFT (Fast Fourier Transform) algorithm is used to analyse sound by obtaining the frequency spectrum information from waveform data. The frequency of the voice is obtained and finally a music note that has the nearest pitch is selected.
In U.S. Pat. No. 5,874,686 an autocorrelation-based method is used to detect the pitch of each note. In order to improve the performance and robustness of the pitch-tracking algorithm, a cubic-spline wavelet transform or other suitable wavelet transform is used.
In U.S. Pat. No. 6,121,530 the onset time of the voiced sound is divided off as an onset time of each note, a time difference with an onset time of the next note is determined as the span of the note and the maximum value among the fundamental frequencies of each note contained during its span is defined as the highest pitch values.
Automatic melody transcription is the extraction of an acceptable musical description from humming. Typical humming signal consists of a sequence of audible waveforms interspersed with silence. However, there is difficulty in defining the boundary of each note in an acoustic wave and there is also considerable controversy over exactly what pitch is. Sound recognition involves using approximations. Where boundaries between notes are clear and pitch is constant, the prior art can produce reasonable results. However, that is not necessarily so where each audible waveform may contain several notes and pitch is not necessarily maintained, as happens with real people humming. A hummer's inability to maintain a pitch often results in pitch changes within a single note, which may be subsequently misinterpreted as a note change. On the other hand, if a hummer does not pause adequately when humming a string of the same notes, the transcription system might interpret it as one note. The task becomes increasingly difficult in the presence of expressive variations and the physical limitation of the human vocal system.