The present invention generally relates to a voice converting apparatus and a voice converting method that make a voice simulate a target voice and, more particularly, to a voice converting apparatus and a voice converting method that are suitable for use in a karaoke apparatus.
The present invention also relates to a voice analyzing apparatus, a voice analyzing method and a recording medium with a voice analyzing program recorded thereon, which execute a voice/unvoice judgment on an input voice.
Various voice converting apparatuses have been developed by which the frequency characteristic and so on of an inputted voice are converted. For example, some karaoke apparatuses change the pitch of a singing voice to convert the same into a voice of opposite gender (as described in Publication of Translation of International Application No. Hei 8-508581, for example).
In the conventional voice converting apparatuses, however, voice conversion (for example, from male to female and vice versa) is executed only to change voice quality, not to simulate the voice of a particular singer (for example, a professional singer).
It would be amusing to have a karaoke apparatus provide a capability of simulating not only the voice quality but also singing mannerism of a particular singer. It has been impossible for the conventional karaoke apparatus to provide such a capability.
Conventionally, there have been proposed various voice conversion techniques to convert the pitch and voice quality by modifying attributes of a voice signal. FIG. 37 illustrates a first pitch converting method; FIG. 38 illustrates a second converting method.
As shown in FIG. 37, the first method is to execute such pitch conversion as to re-sample the waveform of an input voice signal and to compress or expand the waveform. According to this method, when the waveform is compressed, the pitch shifts up because of a rise in the basic frequency; while when it is expanded, the pitch shifts down because of a drop in the basic frequency.
On the other hand, as shown in FIG. 38 and according to the second method, the waveform of the input voice signal is extracted periodically and reconstructed at a desired pitch interval. This allows pitch conversion without changing frequency characteristics of the input voice signal.
In the above conventional methods, however, the voice conversion is insufficient to naturally convert a male voice to a female voice and vice versa. For example, if conversion is executed from the male voice to the female voice, the pitch must be raised by compressing the sampled signal as shown in FIG. 37, because the pitch of the female voice is typically higher than that of the male voice. Such pitch conversion, however, involves changing a frequency characteristic (formant) of the input voice signal. Since the pitch conversion is accompanied by changing the voice quality, natural and feminine voice quality has not been obtained by such conventional pitch conversion. On the other hand, if only the pitch is converted by the method shown in FIG. 38, the voice quality remains manly, not naturally feminine.
For voice quality conversion from a male voice to a female voice, a technique combining the above two methods, namely such a technique as to make the voice quality feminine by doubling the pitch and giving a certain amount of compression to a waveform extracted during one cycle has also been proposed. However, it has been difficult even for this technique to execute such voice conversion as to provide desired natural voice quality.
Further, in the above conventional techniques, all the voice conversion processing has been executed on the time axis, so that only waveforms of input voice signals have been able to be converted, resulting in low freedom of processing. This has also made it difficult to convert the voice quality and pitch naturally.
Conventionally, various techniques for voice/unvoice judgment on an input voice signal have been proposed in the field of voice analysis technology. Typical one of such techniques is to judge the input voice signal to be unvoiced when waveform zero-crossing counts obtained in a unit time is relatively great. There are also other judgment techniques, such as one using an auto-correlation function and one using a cepstrum analysis. Such techniques are described in “The Acoustic Analysis of Speech” (written by Ray D. Kent at al, the first edition dated May 10, 1996, published by Kaibundo).
Unvoiced sounds include not only strident sounds such as “s” but also plosive sounds such as “p”. The above-mentioned judgment technique based on the zero crossing counts can discriminate the strident sounds (e.g., “s”), but not discriminate the plosive sounds (e.g., “p”). Even neither the method using the auto-correlation function nor the method using the cepstrum analysis has been sufficient for perfect judgment of the voiced and unvoiced sound. Thus, the conventional techniques involve a problem that the voice/unvoice judgment cannot be executed accurately.