The present disclosure relates to a music signal processing apparatus and method, and a program, and more particularly, to a music signal processing apparatus and method, and a program that are capable of precisely extracting a singing voice without increasing a processing load.
Recently, there has been an increasing demand for search for a melody related to a singing voice from a lot of musical pieces. For example, a humming search to search for a musical piece based on a user's singing voice or humming, a cover song search to search for the original version of a cover-version musical piece, and the like are performed.
As a method of estimating a feature amount of the melody related to the singing voice, i.e., a fundamental frequency of the singing voice, from a voice signal of the musical piece, a method of estimating the feature amount from a maximum peak of a frequency spectrum is proposed (see, for example, M. Goto, “A real-time music-scene-description system: predominant-F0 estimation for detecting melody and bass line in real-world audio signals”, Speech Communication (ISCA Journal), Vol. 43, No. 4, pp. 311-329, September, 2004).
Additionally, a method of extracting a singing voice by using pitch fluctuations of the singing voice is also proposed (see, for example, H. Tachibana, T. Ono, N. Ono, S. Sagayama, “Melody line estimation in homophonic music audio signals based on temporal-variability of melodic source”, in Proc. of ICASSP 2010, pp. 425-428, March, 2010).
In the technology of “Melody line estimation in homophonic music audio signals based on temporal-variability of melodic source”, energy in frequency direction and energy in temporal direction are analyzed to extract the feature amount of the fundamental frequency of the singing voice and the like.