It is well known to be difficult to speak or sing along with an audio or audio/video clip such that the new performance is a precisely synchronised repetition of the original actor's or singer's words. Consequently, a recording of the new performance is very unlikely to have its start and detailed acoustic properties synchronized with those of the original audio track. Similarly, features such as the pitch of a new singer may not be as accurate or intricately varied as those of the original singer. There are many instances in the professional audio recording industry and in consumer computer-based games and activities where a sound recording is made of a voice and the musical pitch of the newly recorded voice would benefit from pitch adjustment, generally meaning correction, to put it in tune with an original voice recording. In addition, a recording of a normal amateur singing, even if in tune, will not have the skilful vocal style and pitch inflections of a professional singer.
FIG. 4 displays pitch measurements of a professional singer (Guide Pitch 401) and a member of the public (New Pitch 402) singing of the same words to the same musical track. The timing discrepancies between the onsets and offsets of corresponding sections (pulses) of voiced signals (non-zero Hz pitch values) as well as positions of unvoiced or silent sections (at zero Hz) are frequent and significant. Applying pitch data from the Guide Pitch 401 directly at the same relative times to the data of the New Pitch 402 would clearly be wrong and inappropriate for a substantial amount of the segment shown. This is a typical result and illustrates the basic problems to be solved.
Musical note-by-note pitch adjustment can be applied automatically to recorded or live singing by commercially available hardware and software devices, which generally tune incoming notes to specified fixed grids of acceptable note pitches. In such systems, each output note can be corrected automatically, but this approach can often lead to unacceptable or displeasing results because it can remove natural and desirable “human” variations.
The fundamental basis for target pitch identification in such known software and hardware devices is a musical scale, which is basically a list of those specific notes' frequencies to which the device should first compare the input signal. Most devices come with preset musical scales for standard scales and allow customisation of these, for example to change the target pitches or to leave certain pitched notes unaltered.
The known software devices can be set to an automatic mode, which is also generally how the hardware devices work: the device detects the input pitch, identifies the closest scale note in a user-specified preset scale, and changes the input signal such that the output pitch matches the pitch of the specified scale's note. The rate at which the output pitch is slewed and retuned to the target pitch, sometimes described as “speed”, is controlled to help maintain natural pitch contours (i.e. pitch as a function of time) more accurately and naturally and allow a wider variety of “styles”.
However, the recorded singing of an amateur cannot be enhanced by such known automatic adjustment techniques to achieve the complex and skilled pitch variations found in the performance of a professional singer.
There are also known voice processing methods and systems which perform pitch correction and/or other vocal modifications by using target voices or other stored sequences of target voice parameter data to specify the desired modifications. These known methods have one or more significant shortcomings. For example:                1. The target pitch (or other vocal feature) that is being applied to the user's input voice signal rigidly follows the timing of a Karaoke track or other such accompaniment that the user sings to—generally in real time—and no attempt is made to align corresponding vocal features (U.S. Pat. No. 5,966,687, Japanese patent 2003044066). If the user's voice starts too early relative to the timing of the target feature (e.g. pitch) data, then the target feature will be applied, wrongly, to later words or syllables. A similar problem arises if the user's voice is late. Within phrases, any words or syllables that are out of time with the music track will be assigned the wrong pitch or other feature for that word or syllable. Similarly, any voiced segments that occur when unvoiced segments are expected receive no stored target pitch or other target feature information.        2. The target pitch (or other vocal feature) being applied to the user's input voice relies on and follows the detection of an expected stored sequence of input phonemes or similarly voiced/unvoiced patterns or just vowels (e.g. U.S. Pat. No. 5,750,912). Such methods generally require user training or inputting of fixed characteristics of phoneme data and/or require a sufficiently close pronunciation of the same words for accurate identification to occur. If there is no training and the user's phoneme set differs sufficiently from the stored set to not be recognized, the system will not function properly. If user's phonemes are not held long enough, or are too short, the output notes can be truncated or cut off. If phonemes arrive too early or too late, the pitch or feature might be applied to the right phoneme, but it will be out of time with the musical accompaniment. If the user utters the wrong phoneme(s), the system can easily fail to maintain matches. Moreover, in a song, a single phoneme will often be given a range of multiple and/or a continuum of pitches on which a phonemic based system would be unlikely to implement the correct pitch or feature changes. Accurate phoneme recognition also requires a non-zero processing time—which could delay the application of the correct features in a real-time system. Non-vocal sounds (e.g. a flute) cannot be used as guide signals or inputs.        3. The target pitch model is based on a set of discrete notes described typically by tables (e.g. as Midi data), which is generally quantized in both pitch and time. In this case, the modifications to the input voice are limited to the stored notes. This approach leads to a restricted set of available vocal patterns that can be generated. Inter-note transitions, vibrato and glissando control would be generally limited to coarse note-based descriptors (i.e. Midi). Also, the processed pitch-corrected singing voice can take on a mechanical (monotonic) sound, and if the pitch is applied to the wrong part of a word by mistiming, then the song will sound oddly sung and possibly out of tune as well.        4. The system is designed to work in near real-time (as in a live Karaoke system) and create an output shortly (i.e. within a fraction of a second) after the input (to be corrected) has been received. Those that use phoneme or similar features (e.g. U.S. Pat. No. 5,750,912) are restricted to a very localized time slot. Such systems can get out of step, leading for example, to the Karaoke singer's vowels being matched to the wrong part of the guiding target singing.        