1. Technical Field of the Invention
The present invention relates to a sound analysis apparatus and program that estimates pitches (which denotes fundamental frequencies in this specification) of melody and bass sounds in a musical audio signal, which collectively includes a vocal sound and a plurality of types of musical instrument sounds, the musical audio signal being contained in a commercially available compact disc (CD) or the like.
2. Description of the Related Art
It is very difficult to estimate the pitch of a specific sound source in a monophonic audio signal in which sounds of a plurality of sound sources are mixed. One substantial reason why it is difficult to estimate a pitch in a mixed sound is that the frequency components of one sound overlap those of another sound played at the same time in the time-frequency domain. For example, a part (especially, the fundamental frequency component) of the harmonic structure of a vocal sound, which carries the melody, in a piece of typical popular music played with a keyboard instrument (such as a piano), a guitar, a bass guitar, a drum, etc., frequently overlaps harmonic components of the keyboard instrument and the guitar or high-order harmonic components of the bass guitar, and noise components included in sounds of a snare drum or the like. For this reason, techniques of locally tracking each frequency component do not reliably work for complex mixed sounds. Some techniques estimate a harmonic structure with the assumption that fundamental frequency components are present. However, these techniques have a serious problem in that they do not address the missing fundamental phenomenon. Also, these techniques are not effective when the fundamental frequency components overlap frequency components of another sound played at the same time.
Although, for this reason, some conventional technologies estimate a pitch in an audio signal containing a single sound alone or a single sound with aperiodic noise, no technology has been provided to estimate a pitch in a mixture of a plurality of sounds such as an audio signal recorded in a commercially available CD.
However, a technology to appropriately estimate the pitches of sounds included in a mixed sound using a statistical technique has been proposed recently. This technology is described in Japanese Patent Registration No. 3413634.
In the technology of Japanese Patent Registration No. 3413634, frequency components in a frequency range which is considered that of a melody sound and frequency components in a frequency range which is considered that of a bass sound are separately obtained from an input audio signal using BPFs and the fundamental frequency of each of the melody and bass sounds is estimated based on the frequency components of the corresponding frequency range.
More specifically, the technology of Japanese Patent Registration No. 3413634 prepares tone models, each of which has a probability density distribution corresponding to the harmonic structure of a corresponding sound, and assumes that the frequency components of each of the frequency ranges of the melody and bass sounds have a mixed distribution obtained by weighted mixture of tone models corresponding respectively to a variety of fundamental frequencies. The respective weights of the tone models are estimated using an Expectation-Maximization (EM) algorithm.
The EM algorithm is an iterative algorithm which performs maximum-likelihood estimation of a probability model including a hidden variable and thus can obtain a local optimal solution. Since a probability density distribution with the highest weight can be considered that of a harmonic structure that is most dominant at the moment, the fundamental frequency of the most dominant harmonic structure can then be determined to be the pitch. Since this technique does not depend on the presence of fundamental frequency components, it can appropriately address the missing fundamental phenomenon and can obtain the most dominant harmonic structure regardless of the presence of fundamental frequency components.
However, such a simply determined pitch may be unreliable since, if peaks corresponding to fundamental frequencies of sounds played at the same time are competitive in a fundamental frequency probability density function, these peaks may be selected in turns as the maximum value of the probability density function. Thus, to estimate a fundamental frequency from a broad viewpoint, the technology of Japanese Patent Registration No. 3413634 successively tracks the trajectory of a plurality of peaks in the fundamental frequency probability density function as the function changes with time, and selects a fundamental frequency trajectory, which is most dominant and reliable (or stable), from the tracked trajectories. A multi-agent model has been introduced to dynamically and flexibly control this tracking process.
The multi-agent model includes one salience detector and a plurality of agents. The salience detector detects salient peaks that are prominent in the fundamental frequency probability density function. The agents are activated basically to track the trajectories of the peaks. That is, the multi-agent model is a general-purpose framework that temporally tracks features that are prominent in an input audio signal.
However, the technology described in Japanese Patent Registration No. 3413634 has a problem in that every frequency in the pass range of the BPF may be estimated to be a fundamental frequency. For example, when an input audio signal is generated by playing a specific musical instrument, we cannot exclude the possibility that the fundamental frequency of a false sound which could not be generated by playing the musical instrument is erroneously estimated to be a fundamental frequency in the input audio signal.