Pitch perception plays an important role in human hearing and in the understanding of sounds. In an acoustic environment a human listener is capable of perceiving the pitches of several sounds simultaneously, and can use the pitch to separate sounds in a mixture of sounds. In general, a sound can be said to have a certain pitch if it can be reliably matched by adjusting the frequency of a sine wave of arbitrary amplitude.
Music transcription as employed herein may be considered to be an automatic process that analyzes a music signal so as to record the parameters of the sounds that occur in the music signal. Generally in music transcription, one attempts to find parameters that constitute music from an acoustic signal that contains the music. These parameters may include, for example, the pitches of notes, the rhythm and loudness.
Reference can be made, for example, to Anssi P. Klapuri, “Signal Processing Methods for the Automatic Transcription of Music”, Thesis for degree of Doctor of Technology, Tampere University of Technology, Tampere FI 2004 (ISBN 952-15-1147-8, ISSN 1459-2045), and to the six publications appended thereto.
Western music generally assumes equal temperament (i.e., equal tuning), in which the ratio of the frequencies of successive semi-tones (notes that are one half step apart) is a constant. For example, and referring to Klapuri, A. P., “Multiple Fundamental Frequency Estimation Based on Harmonicity and Spectral Smoothness”, IEEE Trans. On Speech and Audio Processing, Vol. 11, No. 6,804-816, November 2003, it is known that notes can be arranged on a logarithmic scale where the fundamental frequency Fk of a note k is Fk=440×2(k/12) Hz. In this system, a′ (440 Hz) receives the value k=0. The notes below a′ (in pitch) receive negative values while the notes above a′ receive positive values. In this system k can be converted to a MIDI (Musical Instrument Digital Interface) note number by adding the value 69. General reference with regard to MIDI can be made to “MIDI 1.0 Detailed Specification”, The MIDI Manufacturers Association, Los Angeles, Calif.
A problem that can arise during pitch extraction is illustrated in the following examples that demonstrate an increase in the probability for an error to occur in pitch extraction when attempting to locate the best pitch estimates for sung, played, or whistled notes. The following examples assume that the relationship Fk=440×2(k/12) Hz is unmodified.
When a skilled vocalist sings a cappella (without an accompaniment), the vocalist is likely to use just intonation as a basis for the scale. Just intonation uses a scale where simple harmonic relations are favored (reference in regard to simple harmonic relations can be made to Klapuri, A. P., “Multipitch Estimation and Sound Separation by the Spectral Smoothness Principle”, Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, Salt Lake City, Utah 2001). In just intonation, ratios m/n (where m and n are integers greater than zero) between the frequencies in each note interval of the scale are adjusted so that m and n are small:F=(m/n)Fr, where Fr is the frequency of the root note of the key.  (1)
In addition, an a cappella vocalist may loose the sense of a key and sing an interval so that m and n in the ratio of the frequencies of consecutive notes are small:Fk+1=(m/n)Fk.  (2)
There may also be a constant error in tuning, where an a cappella vocalist may use his/her own temperament by singing constantly out of tune.
An additional problem can arise when music is composed to utilize a tuning other than equal temperament, e.g., as typically occurs in non-Western music.
Ryynänen, M., in “Probabilistic Modelling of Note Events in the Transcription of Monophonic Melodies”, Master of Science Thesis, Tampere University of Technology, 2004, has proposed an algorithm for the tuning of pitch estimates for pitch extraction in the automatic transcription of music. The algorithm initializes and updates a specific histogram mass center ct based on an initial pitch estimate x′t for an extracted frequency, where x′t is calculated as:x′t=69+12 log2(Ft/440).  (3)
A final pitch estimate is made as: xt=x′t+ct.
The foregoing algorithm is based on equal temperament. However, there are some applications that are not well served by an algorithm based on equal temperament, such as when it is desired to accurately extract pitch from audio signals that contain singing or whistling, or from audio signals that represent non-Western music or other music that does not exhibit equal temperament.