1. Field of the Invention
The present invention relates to a voice converter for assimilating a user voice to be processed to a different target voice, a voice converting method, and a voice conversion dictionary generating method for generating a voice conversion dictionary corresponding to the target voice used for the voice conversion, and more particularly to a voice converter, a voice converting method, and a voice conversion dictionary generating method preferred to be used for a karaoke apparatus.
In addition, the present invention relates to a voice processing apparatus for associating in time series a target voice with an input voice for temporal alignment, and to a karaoke apparatus having the voice processing apparatus.
2. Related Background Art
There have been developed various kinds of voice converters which change frequency characteristics of an input voice before an output. For example, there are karaoke apparatuses that convert a pitch of a singing voice of a karaoke player so as to convert a male voice to a female voice or vice versa (for example, Japanese PCT Publication No. 8-508581).
In the conventional voice converters, however, the voice conversion is limited to a conversion in only a voice quality though a voice is converted (for example, a male voice to a female voice, a female voice to a male voice, etc.) and therefore they are not capable of converting a voice to another in imitation of a voice of a specific singer (for example, a professional singer).
Furthermore, a karaoke apparatus would be very entertaining if it had something like an imitative function of assimilating not only a voice quality but also a way of singing to that of the professional singer. In the conventional voice converters, however, this kind of processing is impossible.
Accordingly, the inventors suggest a voice converter for a conversion in imitation of a voice of a singer to be targeted (a target singer) by analyzing the target singer""s voice so as to assimilate a voice quality of the user to the target singer""s voice, retaining achieved analysis data including a sinusoidal component attribute pitch, an amplitude, a spectrum shape, and residual components as target frame data for all frames of a music piece, and performing a conversion in synchronization with the input frame data obtained by analyzing the input voice (Refer to Japanese Patent Application No. 10-183338).
While the above voice converter is capable of assimilating not only a voice quality, but also a way of singing to that of the target singer, analysis data of the target singer is required for each music piece and therefore a data amount becomes enormously large when analysis data of a plurality of music pieces are stored.
Conventionally in a technical field of karaoke or the like, there has been provided a voice processing technology of converting a singing voice of a singer to another in imitation of a singing voice of a specific singer such as a professional singer. Generally this voice processing requires an execution of alignment for associating two voice signals with each other in time series. For example, in synthesizing a target singer""s voice vocalized xe2x80x9cnakinagara (with tears)xe2x80x9d based on a singer""s voice vocalized xe2x80x9cnakinagaraxe2x80x9d in imitation of the target, the sound xe2x80x9ckixe2x80x9d may be vocalized by the target singer at a different timing from that of the user singer.
In this manner, even if each person vocalizes the same sound, the duration is not identical and the sound may be non-linearly elongated or contracted in many cases. Therefore, in a comparison of two voices, there is known a DP matching method (dynamic time warping: DTW) for time normalization by elongating and contracting a time axis non-linearly so that the phonemes correspond to each other in the two voices. In the DP matching method, a typical time series is used as a standard pattern regarding a word or a phoneme, and therefore voices can be matched in units of a phoneme against a temporal structural change of a time-series pattern.
Additionally, there is known a technique using a hidden Markov model .(HMM) having an excellent effect against a spectral fluctuation. In the hidden Markov model, a statistical fluctuation in the spectral time series can be reflected on a parameter of a model and therefore voices can be matched in units of a phoneme against a spectral fluctuation caused by individual variations of speakers.
However, the use of the above DP matching method deteriorates a precision for a spectral fluctuation and the conventional use of a hidden Markov model requires a large amount of a storage capacity and computation, and therefore both of them are unsuitable for voice process requiring real-time characteristics such as imitation in a karaoke apparatus.
Therefore, it is an object of the present invention to provide a voice converter capable of assimilating an input singer""s voice to a target voice in a way of singing of a target singer and capable of reducing an analysis data amount of the target singer, voice converting method, and a voice conversion dictionary generating method.
It is another object of the present invention to provide a voice processing apparatus capable of executing real-time processing with a small storage capacity for voice processing of associating in time series a target voice with an input voice for temporal alignment, and a karaoke apparatus having the voice processing apparatus.
In one aspect of the invention, a voice converting apparatus is constructed for converting an input voice into an output voice according to a target voice. The apparatus comprises a storage section that provisionally stores source data, which is associated to and extracted from the target voice, an analyzing section that analyzes the input voice to extract therefrom a series of input data frames representing the input voice, a producing section that produces a series of target data frames representing the target voice based on the source data, while aligning the target data frames with the input data frames to secure synchronization between the target data frames and the input data frames, and a synthesizing section that synthesizes the output voice according to the target data frames and the input data frames.
Preferably, the storage section stores the source data containing pitch trajectory information representing a trajectory of a pitch of a phrase constituted by the target voice, phonetic notation information representing a sequence of phonemes with duration thereof in correspondence with the phrase of the target voice, and spectrum shape information representing a spectrum shape of each phoneme of the target voice. Further, the storage section stores the source data containing amplitude trajectory information representing a trajectory of an amplitude of the phrase constituted by the target voice.
Preferably, the producing section comprises a characteristic analyzer that extracts from the input voice a characteristic vector which is characteristic of the input voice, a memory that memorizes recognition phoneme data for use in recognition of phonemes contained in the input voice and target behavior data which is a part of the source data and which represents a behavior of the target voice, an alignment processor that determines a temporal relation between the input data frames and the target data frames according to the characteristic vector, the recognition phoneme data and the target behavior data so as to output alignment data corresponding to the determined temporal relation, and a target decoder that produces the target data frames according to the alignment data, the input data frames and the source data containing phoneme data representing phonemes of the target voice. Further, the producing section comprises a data converter that converts the target behavior data in response to parameter control data provided from an external into pitch trajectory information representing a trajectory of a pitch of the target voice, amplitude trajectory information representing a trajectory of an amplitude of the target voice, and phonetic notation information representing a sequence of phonemes with duration thereof in correspondence with the target voice, and that feeds the pitch trajectory information and the amplitude trajectory information to the target decoder and feeds the phonetic notation information to the alignment processor.
Preferably, the target decoder includes an interpolator that produces a target data frame by interpolating spectrum shapes representing phonemes of the target voice. The interpolator produces a target data frame of a particular phoneme at a desired particular pitch by interpolating a pair of spectrum shapes corresponding to the same phoneme as the particular phoneme but sampled at different pitches than the desired pitch. Further, the target decoder includes a state detector that detects whether the input voice is placed in a stable state at a certain phoneme or in a transition state from a preceding phoneme to a succeeding phoneme, such that the interpolator operates when the input voice is detected to be in the transition state for interpolating a spectrum shape of the preceding phoneme and another spectrum shape of the succeeding phoneme with each other.
Preferably, the interpolator utilizes a modifier function for the interpolation of a pair of spectrum shapes so as to modify the spectrum shape of the target data frame. In such a case, the target decoder includes a function generator that generates a modifier function utilized for linearly modifying the spectrum shape and another modifier function utilized for nonlinearly modifying the spectrum shape. Practically, the interpolator divides the pair of the spectrum shapes into a plurality of frequency bands and individually applies a plurality of modifier functions to respective ones of the divided frequency bands. Practically, the interpolator operates when the input voice is transited from a preceding phoneme to a succeeding phoneme for utilizing a modifier function specified by the preceding phoneme in the interpolation of a pair of phonemes of the target voice corresponding to the pair of the preceding and succeeding phonemes of the input voice. Preferably, the interpolator operates in real time for determining a modifier function to be utilized in the interpolation according to one of a pitch of the input voice, a pitch of the target voice, an amplitude of the input voice, an amplitude of the target voice, a spectrum shape of the input voice and a spectrum shape of the target voice. Practically, the interpolator divides the pair of the spectrum shapes into a plurality of bands along a frequency axis such that each band contains a pair of fragments taken from the pair of the spectrum shapes, the fragment being a sequence of dots each determined by a set of a frequency and a magnitude, and the interpolator utilizes a modifier function of a linear type for the interpolation of the pair of the fragments a dot by dot in each band. In such a case, the interpolator comprises a frequency interpolator that utilizes the modifier function for interpolating a pair of frequencies contained in a pair of dots corresponding to each other between the pair of the fragments, and a magnitude interpolator that utilizes the modifier function for interpolating a pair of magnitudes contained in the pair of dots corresponding to each other.
Preferably, the target decoder produces the target data frames such that each target data frame contains a spectrum shape having an amplitude and a spectrum tilt, and the target decoder includes a tilt corrector that corrects the spectrum tilt in matching with the amplitude. In such a case, the tilt corrector has a plurality of filters selectively applied to the spectrum shape of the target data frame to correct the spectrum tilt thereof according to a difference between the spectrum tilt of the target data frame and a spectrum tilt of the corresponding input data frame.
The one aspect of the invention includes a method of producing a phoneme dictionary of a model voice of a model person for use in a voice conversion. The method comprises the steps of sampling the model voice as the model person continuously vocalizes a phoneme while the model person sweeps a pitch of the model voice through a measurable pitch range, analyzing the sampled model voice to extract therefrom a sequence of spectrum shapes along the measurable pitch range, dividing the measurable pitch range into a plurality of segments in correspondence to a plurality of pitch levels, statistically processing a set of spectrum shapes belonging to each segment to produce each averaged spectrum shape in correspondence to each pitch level, and recording the plurality of the averaged spectrum shapes and the plurality of the corresponding pitch levels to form the phoneme dictionary in which each phoneme sampled from the model person is represented by variable ones of the averaged spectrum shapes in terms of the pitch levels. Further, the step of statistically processing comprises dividing the set of the spectrum shapes into a plurality of frequency bands, then calculating an average of magnitudes of the spectrum shape at each frequency band, and collecting all of the calculated averages throughout all of the frequency bands to obtain the averaged spectrum shape.
In another aspect of the invention, a voice processing apparatus is constructed for aligning a sequence of phonemes of a target voice represented by a time-series of frames with a sequence of phonemes of an input voice represented by a time-series of frames. The apparatus comprises a target storage section that stores a sequence of phonemes contained in the target voice, the sequence of the phonemes being obtained by provisionally analyzing the time-series of the frames of the target voice, a phoneme storage section that stores a code book containing characteristic vectors representing characteristic parameters typical to phonemes, the characteristic vector being clustered into a number of symbols in the code book, and that stores a transition probability of a state of each phoneme and an observation probability of each symbol, a quantizing section that analyzes the time-series of the frames of the input voice to extract therefrom the characteristic parameters, and that quantizes the characteristic parameters into observed symbols of the input voice according to the code book stored in the phoneme storage section, a state forming section that applies a hidden Markov model to the sequence of the phonemes of the target voice stored in the target storage section so as to estimate therefrom a time-series of states of the phonemes of the target voice based on the transition probability of the state of each phoneme and the observation probability of each symbol stored in the phoneme storage section, a transition determining section that determines transitions of states occurring in the sequence of the phonemes of the input voice by a Viterbi algorithm based on the observed symbols of the input voice and the estimated time-series of the states of the phonemes of the target voice, and an aligning section that aligns the sequence of the phonemes of the target voice and the sequence of the phonemes of the input voice with each other according to the determined state transitions of the input voice.
Preferably, the code book contains a characteristic vector which characterizes a spectrum of a voice in terms of a mel-cepstrum coefficient. The code book contains a characteristic vector which characterizes a spectrum of a voice in terms of a differential mel-cepstrum coefficient. The code book contains a characteristic vector which characterizes a voice in terms of a differential energy coefficient. The code book contains a characteristic vector which characterizes a voice in terms of an energy. The code book contains a characteristic vector which characterizes a voiceness of a voice in terms of a zero-cross rate and a pitch error observed in a waveform of the voice.
Preferably, the phoneme storage section stores the code book produced by quantization of predicted vectors of a given learning set using an algorithm for clustering. The phoneme storage section stores the transition probability of each state and the observation probability of each symbol with respect to the characteristic vector of each phoneme, the characteristic vector being obtained by estimating characteristic parameters maximizing a likelihood of a model for learning data.
Preferably, the transition determining section searches for an optimal state among a number of states around a current state of the estimated time-series of the states as to determine a transition from the current state to the optimal state occurring in the sequence of the phonemes of the input voice.
Preferably, the state forming section estimates the time-series of states of the phonemes of the target voice such that the time-series of states contains a pass from one state of one phoneme to another state of another phoneme and an alternative pass from one state to another state via a silent state or an aspiration state. Further, the state forming section estimates the time-series of states of the phonemes of the target voice such that the time-series of states contains parallel passes from one state of one phoneme to another state of another phoneme via different states of similar phonemes having equivalent transition probabilities.
Preferably, the aligning section aligns the sequence of the phonemes of the target voice and the sequence of the phonemes of the input voice with each other such that each phoneme has a region containing a variable number of frames and such that the number of frames contained in each region of each phoneme can be adjusted for the aligning of the target voice with the input voice. In such a case, the aligning section operates when a number of frames contained in a region of a phoneme of the input voice is greater than a number of frames contained in a corresponding region of the same phoneme of the target voice for adding a provisionally stored frame into the corresponding region, thereby expanding the corresponding region of the target voice in alignment with the region of the input voice. Further, the aligning section operates when a number of frames contained in a region of a phoneme of the input voice is smaller than a number of frames contained in a corresponding region of the same phoneme of the target voice for deleting one or more frame from the corresponding region, thereby compressing the corresponding region of the target voice in alignment with the region of the input voice.
Preferably, the transition determining section operates when determining a transition from a current state of a fricative phoneme for evaluating both of a transition probability to another state of another fricative phoneme and a transition probability to another state of a next phoneme of the target voice.
Preferably, the voice processing apparatus further comprises a synthesizing section that synthesizes the frames of the input voice and the frames of the target voice with each other synchronously by a frame to a frame after the input voice and the target voice are temporally aligned with each other. Further, the apparatus comprises an analyzing section that analyzes each frame of the input voice to extract therefrom sinusoidal components and residual components contained in each frame, wherein the target storage section stores the frames of the target voice such that each frame contains sinusoidal components and residual components provisionally extracted from the target voice, and wherein the synthesizing section mixes the sinusoidal components or the residual components of the input voice and the sinusoidal components or the residual components of the target voice with each other at a predetermined ratio at each frame. Further, the apparatus comprises a waveform generating section for applying an inverse Fourier transform to the mixed sinusoidal components and the residual components so as to generate a waveform of a synthesized voice.
Practically, the inventive apparatus further comprises a music storage section that stores music data representative of a karaoke music piece, a reproducing section that reproduces the karaoke music piece according to the stored music data, a synchronizing section that synchronizes the time-series of the frames of the target voice sampled from a model singer with a temporal progress of the karaoke music piece, a synthesizing section that synthesizes the frames of the input voice of a karaoke player and the frames of the target voice of the model singer with each other synchronously by a frame to a frame after the input voice and the target voice are temporally aligned with each other to form a time-series of an output voice, and a sounding section that sounds the output voice along with the karaoke music piece. In such a case, the transition determining section weighs the transition probability of each state of each phoneme in synchronization with the temporal progress of the karaoke music piece when the transition determining section determines transitions of states occurring in the sequence of the phonemes of the input voice.