Karaoke is a term used to describe a singer that sings along with a prerecorded musical accompaniment. Typical karaoke systems enable users to sing along with a background music audio track and display the lyrics of the song on a visual interface. These systems usually focus on one or the other aspect of the singing experience such as live-performance, user-recording or music learning.
Firstly, display of required metadata or characteristic, such as time-aligned lyrics & melody information, is a typical functionality of any karaoke system. An extended & detailed characteristics representation, extracted from the original polyphonic music track, is required for a comprehensive vocal performance system that is expected to provide detailed performance analysis and enhancements to user recordings. However, methods in the prior art have described extraction of features related to melody, rhythm, loudness, and timbre used for scoring or transcribing music data (audio to music score).
Secondly, karaoke systems play background music of a song so that a user can perform using such background music. Ideally the background music should be the original background music used in the song. However such music is often not separately available or comes at a high premium cost. Most karaoke systems use re-created or re-recorded background music. Some systems do attempt to extract the original background music by the elimination of the main or lead voice from the original track, which is also known as de-soloing. The main voice could be that of a human singer or of a melodic instrument. Such systems require the original song to be available in stereo (2-channel) format in which the vocals are generally panned in the center of the stereo image. These systems attempt to eliminate the vocals using some processing to cancel the center channel. However, such methods suffer from several drawbacks listed as follows:                Other instruments that are panned in the center are removed as well;        Although some systems attempt to process the vocals in isolation by limiting the frequency range of processing to the vocal frequency range, this still does not help with several instruments whose frequency ranges overlap with the voice;        For several songs the vocals are not panned to the center, which results in incomplete removal;        A lot of popular music was recorded decades ago when using monophonic recording technology, and are not available in stereo format; and        They are not able to suppress effects, such as reverberation and echo, which have been applied to the original singer's voice.        
Thirdly, for the purpose of providing a score and feedback to a user, a comprehensive evaluation of a user's singing proficiency would require comparing the user's recording with the original singer's rendition on different musical dimensions viz. pitch correctness, rhythm correctness, pitch-modulation, vocal dynamics, vocal pitch range, pronunciation of lyrics, and voice quality. Most scoring systems in the prior art consider between 2-3 of the above dimensions of scoring, typically only the first two, viz. pitch and timing correctness. Further the evaluation of the pitch, which is often the single most important musical attribute, is usually restricted to musical notes (in a standard scale). This gives an incomplete, and often incorrect, assessment of singer proficiency since vocal pitch trajectory is a smooth and continuous variation (often with a lot of modulation) unlike a sequence of discrete notes (as would be played on a piano). Utilizing a singing evaluation system for use in a contest or audition (categorizing singers based on their singing attempts without human intervention) or in music learning, would involve detailed scoring along all the above dimensions.
Fourthly, every singing session can result in creation of a new version of the song if the user's voice is recorded during singing. However, the user's recorded voice as it is may not be attractive or exciting enough for the version to proliferate. A system should behave analogous to a recording studio, where the user's voice is enhanced and then appropriately mixed with the background music, resulting in a composite cover version track. Enhancements typically take the form of vocal effects. Vocal effects are broadly divided into two categories: transformative and corrective. Transformative effects involve processing that change the voice identity, such as, helium, chipmunk, duck, gender transformation (male-female and vice-versa) etc. and also effects like echo and reverb. Corrective effects enhance the quality of the recording, by actually correct singing errors in the user's recording. An example of the latter is the correction of the pitch of the user-singing. Pitch correction effects usually only correct the user pitch to a discrete reference note from a musical scale. As mentioned previously, the vocal pitch trajectory is a smooth and continuous variation (often with a lot of modulation) unlike a sequence of discrete notes (as would be played on a piano). So such correction results in a drastic loss of naturalness in the user's voice and often only sounds corrected when the user's pitch is closer to the correct note. Although there are correction mechanisms that change the pitch of the user-singing to another reference voice-pitch, these rely heavily on perfect timing co-incidence of the user and the reference, which is not the case for amateur singers requiring correction.
There has thus been a persistent need for music performance systems, which have typical karaoke functionality and can display feedback about singing proficiency in comparison with a benchmark. Furthermore, on completion of singing, such music performance system give the user comprehensive, credible & detailed feedback on multiple musical dimensions, and also make suggestions for user improvement and recommend suitable songs for the user to attempt. Such system should also be able to identify cases where the user is cheating in order to get a high score e.g. by playing back the original song into the system or just humming the melody without uttering the lyrics and facilitate the creation of an attractive user version of the song, by enhancing the user-recording in various ways, and time-aligning & mixing it with the background music.
The music performance system discussed in the preceding paragraph should extend the stated functionalities for scenarios in which the gender of the user is different from that of the original singer i.e. male singing a female song and vice-versa and also for songs in which there are two singers (duets) by treating each singer's part separately. Additionally, the system should enable the stated functionalities on different platforms, such as computers and hand-held devices, and over a different media, such as mobile and Internet networks and in real-world environments in which the user-recordings may be noisy. Often the user-voice recording on such platforms and recording transmission over such media result in a variable delay or time-lag of the user-recording as compared to the background music. A method of time-alignment then becomes critical to enable correct performance analysis and cover-version creation.
Thus, such music performance system can be used not only for music entertainment i.e. a karaoke singing experience, but will also be suitable for music tutoring, singing contests & auditions, and as a cover version creation system.
Such a system can be used for music performances including singing voice and other musical instruments.
The present invention is contrived in consideration of the circumstances mentioned hereinbefore, and is intended to provide a comprehensive music performance system with aforementioned capabilities.