1. Field of the Invention
The present invention relates to a voice converter for assimilating a user voice to be processed to a different target voice, a voice converting method, and a voice conversion dictionary generating method for generating a voice conversion dictionary corresponding to the target voice used for the voice conversion, and more particularly to a voice converter, a voice converting method, and a voice conversion dictionary generating method preferred to be used for a karaoke apparatus.
In addition, the present invention relates to a voice processing apparatus for associating in time series a target voice with an input voice for temporal alignment, and to a karaoke apparatus having the voice processing apparatus.
2. Related Background Art
There have been developed various kinds of voice converters which change frequency characteristics of an input voice before an output. For example, there are karaoke apparatuses that convert a pitch of a singing voice of a karaoke player so as to convert a male voice to a female voice or vice versa (for example, Japanese PCT Publication No. 8-508581).
In the conventional voice converters, however, the voice conversion is limited to a conversion in only a voice quality though a voice is converted (for example, a male voice to a female voice, a female voice to a male voice, etc.) and therefore they are not capable of converting a voice to another in imitation of a voice of a specific singer (for example, a professional singer).
Furthermore, a karaoke apparatus would be very entertaining if it had something like an imitative function of assimilating not only a voice quality but also a way of singing to that of the professional singer. In the conventional voice converters, however, this kind of processing is impossible.
Accordingly, the inventors suggest a voice converter for a conversion in imitation of a voice of a singer to be targeted (a target singer) by analyzing the target singer's voice so as to assimilate a voice quality of the user to the target singer's voice, retaining achieved analysis data including a sinusoidal component attribute pitch, an amplitude, a spectrum shape, and residual components as target frame data for all frames of a music piece, and performing a conversion in synchronization with the input frame data obtained by analyzing the input voice (Refer to Japanese Patent Application No. 10-183338).
While the above voice converter is capable of assimilating not only a voice quality, but also a way of singing to that of the target singer, analysis data of the target singer is required for each music piece and therefore a data amount becomes enormously large when analysis data of a plurality of music pieces are stored.
Conventionally in a technical field of karaoke or the like, there has been provided a voice processing technology of converting a singing voice of a singer to another in imitation of a singing voice of a specific singer such as a professional singer. Generally this voice processing requires an execution of alignment for associating two voice signals with each other in time series. For example, in synthesizing a target singer's voice vocalized “nakinagara (with tears)” based on a singer's voice vocalized “nakinagara” in imitation of the target, the sound “ki” may be vocalized by the target singer at a different timing from that of the user singer.
In this manner, even if each person vocalizes the same sound, the duration is not identical and the sound may be non-linearly elongated or contracted in many cases. Therefore, in a comparison of two voices, there is known a DP matching method (dynamic time warping: DTW) for time normalization by elongating and contracting a time axis non-linearly so that the phonemes correspond to each other in the two voices. In the DP matching method, a typical time series is used as a standard pattern regarding a word or a phoneme, and therefore voices can be matched in units of a phoneme against a temporal structural change of a time-series pattern.
Additionally, there is known a technique using a hidden Markov model (HMM) having an excellent effect against a spectral fluctuation. In the hidden Markov model, a statistical fluctuation in the spectral time series can be reflected on a parameter of a model and therefore voices can be matched in units of a phoneme against a spectral fluctuation caused by individual variations of speakers.
However, the use of the above DP matching method deteriorates a precision for a spectral fluctuation and the conventional use of a hidden Markov model requires a large amount of a storage capacity and computation, and therefore both of them are unsuitable for voice process requiring real-time characteristics such as imitation in a karaoke apparatus.