Typically, in so-called “Karaoke” equipment, utterance (lyrics) and accompaniment sounds (accompaniments) are temporally synchronized and visually displayed when reproducing or playing back digital music data (music audio signals) recorded in a recording medium such as a compact disc (CD), especially digital music data comprising human voices (e.g. vocals) and non-human sounds (e.g. accompaniments).
In the existing Karaoke equipment, however, accompaniment sounds and vocals of a singer are not exactly synchronized. The lyrics of a song are merely displayed in order on a screen at a tempo or pace planned in the musical score. For this reason, actual timing of utterance often gets of alignment with timing of lyrics displayed on the screen. In addition, synchronization between the vocals and accompaniment sounds is manually performed, thereby requiring a considerable amount of human efforts.
As is typically represented by speech or voice recognition techniques, a technique that analyzes human utterance or speech is conventionally known. This technique is intended to identify uttered portions (lyrics) of digital music data that include vocals alone and do not include accompaniment sounds (which will be hereinafter referred to as “vocals without accompaniments”). With regard to such techniques, some studies have been reported. However, it is extremely difficult to directly apply such speech recognition techniques, which do not take account of the influence given by accompaniment sounds, to commercially available compact disc (CD) recordings or digital music data delivered via a telecommunication network such as the Internet.
One of the studies is directed to vocals accompanied by instrumental sounds and is described in “LyricAlly: Automatic Synchronization of Acoustic Musical Signals and Textual Lyrics” written by Ye Wang, et al. in the proceedings of the 12th ACM International Conference on Multimedia held on 10-15 Oct. 2004 (hereinafter referred to as Non-Patent Reference #1). In this study, the time length of each phoneme duration is learned and vocals are allocated to a plurality of sections (Refer to Non-Patent Reference #1). The technique described in this reference utilizes higher-level information such as beat tracking and detected chorus sections. However, the technique does not take phonologic features (e.g. vowels and consonants) into consideration. As a result, the accuracy is not so high. Due to tight restrictions to the beat and tempo, this technique is not applicable to many kinds of music.
Japanese Patent Publication No. 2001-117582 (hereinafter referred to as Patent Reference #1) discloses a technique of aligning a sequence of phonemes for singing voice or vocals of a user with a sequence of phonemes for vocals of a particular singer using alignment means in Karaoke equipment. However, Patent Reference #1 does not disclose a technique of making temporal alignment between vocal audio signals and lyrics.
Japanese Patent Publication No. 2001-125562 (hereinafter referred to as Patent Reference #2) discloses a technique of extracting a dominant sound audio signal from a mixed sound audio signal including vocals and accompaniment sounds by estimating the pitch of the most dominant sound including a vocal at each time. This technique allows extracting a dominant sound audio signal with reduced accompaniment sounds from the music audio signal.
Further, a technique of reducing accompaniment sounds as is disclosed in Patent Reference #2 is also disclosed in the document entitled “Singer identification based on accompaniment sound reduction and a reliable frame selection” written by Hiromasa Fujihara, Hiroshi Okuno, Masataka Goto. et al. in the Journal Vol. 47, No. 6 of Information Processing Society of Japan, in June 2006 (hereinafter referred to as Non-Patent Reference #2). This document also discloses a technique of extracting a vocal section and a non-vocal section from dominant sound audio signals, using two Gaussian mixture models (GMM) that have learned vocals and non-vocals. The document additionally discloses that LPC-derived mel cepstral coefficients are used as vocal features.
In order to display lyrics that are exactly synchronized with accompaniment sounds, based on the music audio signal comprising human voices (e.g. vocals) and non-human sounds (e.g. accompaniment sounds) as well as lyric information, lyrics having time information are required. In other words, lyrics must be accompanied by time information that indicates how many seconds have elapsed since the start time of music performance at the time that a particular word of the lyrics should be uttered. In the specification, it is referred to as “lyrics tagged with time information”.
It is easy to obtain lyrics in a form of text data, or digital information in a text form. A technique has been demanded that allows fully-automated generation of “lyrics tagged with time information” using “lyric text data” and “music audio signal including vocals uttering the lyrics” (digital music data), with practical accuracy.
Speech recognition is useful in temporally aligning lyrics with a music audio signal including accompaniment sounds. However, the inventors of the present invention have studied and found that a section in which vocals are absent (hereinafter referred to as “non-utterance section” or “non-vocal section”) has adverse influence, thereby significantly reducing the accuracy of temporal alignment.
Accordingly, an object of the present invention is to provide an automatic system for temporal alignment between a music audio signal and lyrics, which is capable of controlling the influence of the non-vocal section to reduce the accuracy of temporal alignment, and to provide a method of making the temporal alignment and a computer program used in the system for this purpose.