The present invention relates to a voice enhancement device which makes the received voice in a portable telephone or the like easier to hear in an environment in which there is ambient background noise.
In recent years, portable telephones have becomes popular, and such portable telephones are now used in various locations. Portable telephones are commonly used not only in quiet locations, but also in noisy environments with ambient noise such as airports and [train] station platforms. Accordingly, the problem of the received voice of portable telephones becoming difficult to hear as a result of ambient noise arises.
The simplest method of making the received voice easier to hear in a noisy environment is to increase the received sound volume in accordance with the noise level. However, if the received sound volume is increased to an excessive extent, there may be cases in which the input into the speaker of the portable telephone becomes excessive, so that sound quality conversely deteriorates. Furthermore, the following problem is also encountered: namely, if the received sound volume is increased, the burden on the auditory sense of the listener (user) is increased, which is undesirable from the standpoint of health.
Generally, when ambient noise is large, the clarity of voice is insufficient, so that the voice becomes difficult to hear. Accordingly, a method is conceivable in which the clarity is improved by amplifying the high-band components of the voice at a fixed rate. In the case of such a method, however, not only the high-band components, but also noise (transmission side noise) components contained in the received voice, are enhanced at the same time, so that the sound quality deteriorates.
Here, there are generally peaks in the voice frequency spectrum, and these peaks are called formants. An example of the voice frequency spectrum is shown in FIG. 1. FIG. 1 shows a case in which there are three peaks (formants) in the spectrum. In order from the low frequency side, these formants are called the first formant, second formant and third formant, and the peak frequencies fp(1), fp(2) and fp(3) of the respective formants are called the formant frequencies.
Generally, the voice spectrum has the property of showing a decrease in amplitude (power) as the frequency becomes higher. Furthermore, the voice clarity has a close relationship to the formants, and it is known that the voice clarity can be improved by enhancing the higher (second and third) formants.
An example of spectral enhancement is shown in FIG. 2. The solid line in FIG. 2 (a) and the dotted line in FIG. 2 (b) show the voice spectrum prior to enhancement. Furthermore, the solid line in FIG. 2 (b) shows the voice spectrum following enhancement. In FIG. 2 (b), the slope of the spectrum as a whole is flattened by increasing the amplitudes of the higher formants; as a result, the clarity of the voice as a whole can be improved.
A method using a band splitting filter (Japanese Patent Application Laid-Open No. 4-328798) is known as a method for improving clarity by enhancing such higher formants. In this method using a band filter, the voice is split into a plurality of frequency bands by part of this band splitting filter, and the respective frequency bands are separately amplified or attenuated. In this method, however, there is no guarantee that the voice formants will always fall within the split frequency bands; accordingly, there is a danger that components other than the formants will also be enhanced, so that the clarity conversely deteriorates.
Furthermore, a method in which protruding parts and indented parts of the voice spectrum are amplified or attenuated (Japanese Patent Application Laid-Open No. 2000-117573) is known as a method for solving the problems encountered in the abovementioned conventional method using a band filter. A block diagram of this conventional technique is shown in FIG. 3. In this method, the spectrum of the input voice is determined by a spectrum estimating part 100, protruding bands and indented bands are determined from the determined spectrum by a protruding band (peak)/indented band (valley) determining part 101, and the amplification factor (or attenuation factor) is determined for these protruding bands and indented bands.
Next, coefficients for realizing the abovementioned amplification factor (or attenuation factor) are given to a filter part 103 by a filter construction part 102, and enhancement of the spectrum is realized by inputting the input voice into the abovementioned filter part 103.
In other words, in conventional methods using a band filter, voice enhancement is realized by separately amplifying peaks and valleys of the voice spectrum.
In the abovementioned conventional technique, in the case of methods in which the sound quantity is increased, there are cases in which an increase in the sound quantity results in an excessive input into the speaker, so that the playback sound is distorted. Furthermore, if the received sound quantity is increased, the burden on the auditory sense of the listener (user) is increased, which is undesirable from a health standpoint.
Furthermore, in conventional methods using a high-band enhancement filter, if simple high-band enhancement is used, high bands of noise other than the voice are enhanced, so that the feeling of noise is increased, which does not always lead to an improvement in clarity.
Moreover, in conventional methods using a band splitting filter, there is no guarantee that the voice formants will always fall within the split frequency bands. Accordingly, there may be cases in which components other than the formants are enhanced, so that the clarity conversely deteriorates. Furthermore, since the input voice is amplified without separating the sound source characteristics and the vocal tract characteristics, the problem of severe distortion of the sound source characteristics arises.
FIG. 4 shows a voice production model. In the process of voice production, the sound source signal produced by the sound source (vocal chords) 110 is input into a sound adjustment system (vocal tract) 111, and vocal tract characteristics are added in this vocal tract 111. Subsequently, the voice is finally output as a voice waveform from the lips 112 (see “Onsei no Konoritsu Fugoka” [“High Efficiency Encoding of Voice”], pp. 69–71, by Toshio Nakada, Morikita Shuppan).
Here, the sound source characteristics and vocal tract characteristics are completely different characteristics; however, in the case of the abovementioned conventional technique using a band splitting filter, the voice is directly amplified without splitting the voice into sound source characteristics and vocal tract characteristics. Accordingly, the following problem arises: namely, the distortion of the sound source characteristics is great, so that the feeling of noise is increased, and the clarity deteriorates. An example is shown in FIGS. 5 and 6. FIG. 5 shows the input voice spectrum prior to enhancement processing. Furthermore, FIG. 6 shows the spectrum in a case where the input voice shown in FIG. 5 is enhanced by a method using a band splitting filter. In FIG. 6, the amplitude is amplified while maintaining the outline shape of the spectrum in the case of high band components of 2 kHz or greater. However, in the case of portions in the range of 500 Hz to 2 kHz (portions surrounded by circles in FIG. 6), it is seen that the spectrum differs greatly from the spectrum shown in FIG. 5 prior to enhancement, with a deterioration in the sound source characteristics.
Thus, in conventional methods using a band splitting filter, there is a danger that the distortion of the sound source characteristics will be great, so that the sound quality deteriorates.
Furthermore, in methods in which the abovementioned protruding portions or indented portions of the spectrum are amplified, the following problems exist.
First of all, as in the abovementioned conventional methods using a band splitting filter, the voice itself is directly enhanced without splitting the voice into sound source characteristics and vocal tract characteristics; accordingly, the distortion of the sound source characteristics is great, so that the feeling of noise is increased, thus causing a deterioration in clarity.
Secondly, direct formant enhancement is performed for the LPC (linear prediction coefficient) spectrum or FFT (frequency Fourier transform) spectrum determined from the voice signal (input signal). Consequently, in cases where the input voice is processed for each frame, the conditions of enhancement (amplification factor or attenuation factor) vary between frames. Accordingly, if the amplification factor or attenuation factor varies abruptly between frames, the feeling of noise is increased by the fluctuation of the spectrum.
Such a phenomenon is illustrated in a bird's eye view spectrum diagram. FIG. 7 shows the spectrum of the input voice (prior to enhancement). Furthermore, FIG. 8 shows the voice spectrum in a case where the spectrum is enhanced in frame units. In particular, FIGS. 7 and 8 show voice spectra in which frames that are continuous in time are lined up. It is seen from FIGS. 7 and 8 that the higher formants are enhanced. However, discontinuities are generated in the enhanced spectrum at around 0.95 seconds and around 1.03 seconds in FIG. 8. Specifically, in the spectrum prior to enhancement shown in FIG. 7, the formant frequencies vary smoothly, while in FIG. 8, the formant frequencies vary discontinuously. Such discontinuities in the formants are sensed as a feeling of noise when the processed voice is actually heard.
In FIG. 3, a method in which the frame length is increased is conceived as a method for solving the problem of discontinuity, which is the second of the abovementioned problems. If the frame length is lengthened, average spectral characteristics with little variation over time are obtained. However, when the frame length is lengthened, the problem of a large delay time arises. In communications applications such as portable telephones and the like, it is necessary to minimize the delay time. Accordingly, methods that increase the frame length are undesirable in communications applications.