Japanese Patent Application Publication No. 2001-117582 (JP2001-117582A) discloses a technique for aligning a sequence of phonemes for vocal or singing voice of a singing user (inputter) with a sequence of phonemes for vocal or singing voice of a particular singer using some aligning means in Karaoke equipment. However, JP2001-117582A does not disclose a technique for temporal alignment between music audio signals and lyrics.
Japanese Patent Application Publication No. 2001-125562 (JP2001-125562A) discloses a technique for extracting dominant sound audio signals from mixed sound audio signals including polyphonic sound mixture of vocals and accompaniment sounds by estimating the pitch of the most dominant sound including the vocal or singing voice at each point of time. This technique allows extraction of dominant sound audio signals with reduced accompaniment sounds from the music audio signals.
Further, a technique for reducing accompaniment sounds as is disclosed in JP2001-125562A is also disclosed in the academic paper titled “Singer identification based on accompaniment sound reduction and reliable frame selection” written by Hiromasa Fujihara, Hiroshi Okuno, Masataka Goto, et al. in the Journal Vol. 47, No. 6 of Information Processing Society of Japan, June 2006 (Reference 2). Reference 2 also discloses a technique for extracting vocal and non-vocal sections from dominant sound audio signals, using two Gaussian mixture models (GMM) that have learned vocal and non-vocal. The document additionally discloses that LPC-derived mel cepstral coefficients are used as vocal features.
The academic paper titled “Automatic synchronization between lyrics and music CD recordings based on Viterbi alignment of segregated vocal signals” written by Hiromasa Fujihara, Hiroshi Okuno, Masataka Goto, et al. in the study report 2006-MUS-66, pages 37-44 of Information Processing Society of Japan (Reference 2) discloses a system for temporal alignment between lyrics of a song and vocals extracted from music audio signals including vocals and accompaniment sounds. In the disclosed system, the most dominant sound at each point of time is segregated from the audio signal including accompaniment sounds based on the harmonic structure of the audio signal in order to locate the start time and end time of each phrase in the lyrics. This step is referred to as “accompaniment sound reduction”. In many cases, the most dominant sound includes vocal vowels in a section called as a vocal section which includes the vocal. Then, a vocal section is extracted from the segregated audio signal. This section is referred to as “vocal section detection” or “vocal activity detection”. Further, alignment between the lyrics and the segregated vocal is estimated by means of a forced alignment technique called as Viterbi alignment which is used in speech recognition, using a phone model for singing voice adapted for segregated vocals. The system focuses only on vowels and ignores consonants.