Multimedia content has become extremely popular over recent years. The popularity of such multimedia content is mainly due to the convenience of transferring and storing such content. This convenience is made possible by the wide availability of audio formats, such as the MP3 format, which are very compact, and an increase of media bandwidth to the home, such as broadband Internet. Also, the emergence of 3G wireless devices assists in the convenient distribution of multimedia content.
With such a large amount of multimedia content being available to users, an increasing need exists for an effective searching mechanism for multimedia content. One possible way of searching is “retrieval by humming”, whereby a user searches for a desired musical piece by humming the melody of that desired musical pieces to a system. The system in response then outputs to the user information about the musical piece associated with the hummed melody.
Humming is defined herein as singing a melody of a song without expressing the actual words or lyrics of that song.
Besides multimedia retrieval purposes, transcribing of melodies that are in acoustic waveforms, such as a humming signal, into written representation, for example musical notes, is very useful as well. Songwriters can compose tunes without a need for instruments, or students can practice by humming on their own.
As a result, effective processing of humming signals into musical notes is desirable. The musical notes should contain information such as the pitch, the start time and the duration of the respective notes.
In order to effectively process such a humming signal, two distinct steps are required. The first step is the segmentation of the acoustic wave representing the humming signal into notes, whereby determining the start time and duration of each note, and the second step is the detection of the pitch of each segment (or note). The segmentation of the acoustic wave is not as straightforward as it may appear, as there is difficulty in defining the boundary of each note in an acoustic wave. Also, there is considerable controversy over exactly what pitch is.
In the case where the note is made up from a single frequency the frequency of the note is also the pitch. However, a musical note, especially when produced by a human vocal system, is made up from more than one frequency. Accordingly, pitch generally refers to the fundamental frequency of a note.
In most prior art, it is assumed that each note will have a peak in amplitude/power or will be separated by a reasonable amount of silence, and these aspects are used for the segmentation of the acoustic signal. In reality the segmentation of the acoustic signal is considerably more complex.
For example, as is described in U.S. Pat. No. 5,874,686 issued on Feb. 23, 1999, after the peak energy levels of the signal are isolated and tracked, autocorrelation is performed on the signal around those peaks to detect the pitch of each note. In order to improve the performance, speech and robustness of the pitch-tracking algorithm, a cubic-spline wavelet transform (or other suitable wavelet transform) is used.
U.S. Pat. No. 5,038,658 issued on Aug. 13, 1991 discloses segmentation based on both power and pitch information. The final note boundaries are determined without being influenced by fluctuations in acoustic signals or abrupt intrusions of outside sounds.
In the method disclosed in International publication No. WO2004034375, the humming signal is subjected to a process of segmentation based on amplitude gradient that comprises the steps of subjecting the signal to a process of envelope detection, followed by a process of differentiation to calculate a gradient function. This gradient function is then used to determine the note boundaries.
Segmentation may also be done by differentiating the characteristics between onset/offset (unvoiced) and steady state (voiced) portion of the note. A known technique for performing voiced/unvoiced discrimination from the field of speech recognition is relying on the estimation of the Root Mean Square (RMS) power and the Zero Crossing Rate.
Yet another method used for segmenting an acoustic signal is by first grouping a data sample stream of the acoustic signal into frames, with each frame including a predetermined number of data samples. It is usual for the frames to have some degree of overlap of samples. A spectral transformation, such as the Fast Fourier Transform (FFT), is performed on each frame, and a fundamental frequency obtained. This creates a frequency distribution over the frames. Segmentation is then performed by tracking clusters of similar frequencies. Energy or power information is often also used for analysing the signal to identify repeated or glissando notes within each group of frames having a similar frequency distribution.
The prior art methods described above lead to inaccuracies in the segmentation of humming signals, and inaccuracy in the segmentation directly leads to poor results in overall transcription of the humming signal into musical notes.
Tracking of frequency changes alone could not accurately segment notes because in practice, there will exist fast repeating or glissando notes within the humming signal. As a result, pauses in-between these notes cannot be identified easily. Furthermore, a person creating the humming signal is generally unable to maintain a pitch. This results in pitch changes within a single note. This may in turn be subsequently misinterpreted as note change.
Using of energy or power distribution, whether the distribution is as a result of average energy over frames or amplitude/power over samples, to segment the humming signal into notes has difficulties associated as well. For example, the difference in energy level between the high-energy and low-energy notes is often large. Accordingly, using a global threshold to threshold the energy distribution is not possible. An adaptive threshold is required, which in turn requires significant processing time because the value of the adaptive threshold is difficult to calculate. This is particularly true for acoustic signals derived from a male as there is generally no specific pattern in the change in the energy or power information. Hummed songs have fluctuations in relation to the pattern of change. In addition, the sound to be transcribed also often contains abrupt sounds, such as outside noises. In these circumstances, a simple segmentation of sound based on change in the power information would not necessarily lead to any good segmentation of individual sounds.
Furthermore, if the person humming does not pause adequately when humming a string of the same notes, the transcription system might interpret the string of the same notes as a single note. The task also becomes increasingly difficult in the presence of expressive variations and the physical limitation of the human vocal system.