1. Field of the Invention
The present invention relates to an estimation system and method, and more particularly, to a voiced/unvoiced information estimation system used in a vocoder which improves the audio quality of a voiced/unvoiced mixed sound and is appropriate for the vector quantization at a low bit rate.
2. Discussion of the Related Art
Generally, vocoders compress the frequency distribution, strength and waveform of corresponding voice data into codes, transmitting them upon receipt of a human voice through a microphone while decompressing voices at its receiving side. They are being utilized in many fields such as mobile communication terminals, exchangers, and video conference systems. Low bit rate vocoders necessary to multimedia communication and voice storage systems such as NGN-IP(Next Generation Network—Intelligent Peripheral) or VOIP (Voice over Internet Protocol) are mostly CELP (Code-Exited Linear Prediction) vocoders.
Most of vocoders having a bit rate of 4 to 13 Kbps are CELP vocoders which are time domain vocoders. Most of vocoders having a bit rate of less than 4 Kbps are frequency domain vocoders (also known as a harmonic vocoder). The harmonic vocoder represents an excitation signal as a linear combination of harmonics of a fundamental frequency. Accordingly, the audio quality of the combined sound of the harmonic vocoder is less natural for unvoiced signals compared with the CELP vocoder representing an excitation signal in the form of white noise. However, for voiced signals to which most speech signals correspond, the harmonic vocoder can produce good quality sounds at a bit rate much lower than that of the CELP vocoder.
Those vocoders having a very low bit rate of less than 4 Kbps (which will be an important matter of concern later) are mostly harmonic speech coders requiring harmonic analysis. Generally, the harmonic speech coder is composed of a harmonic analyzer and a harmonic synthesizer. In the harmonic analyzer, the part affecting the complexity and audio quality of the harmonic coder is a voiced/unvoiced information estimation module which estimates the voicing level at a frequency band. The harmonic analyzer analyzes harmonic parameters, and calculates voicing levels to quantize and transmit them. The harmonic synthesizer mixes a voiced element and an unvoiced element according to the quantized voicing level and harmonic parameters transmitted from the harmonic encoder.
In the conventional voiced/unvoiced estimation method, three harmonic bands are combined and are set as one voicing level decision band. As illustrated in FIG. 1, the voiced/unvoiced information estimation unit adapting this method includes a spectrum difference calculation unit 10, a threshold calculation unit 20, and a voiced/unvoiced information binary decision unit 30.
Here, the spectrum difference calculation unit 10 performs a normalization process for dividing the difference energy between an input spectrum and a synthetic spectrum by spectrum energy in the current voicing level determination band. The threshold calculation unit 20 calculates the threshold for deciding a voicing level using spectrum energy distribution, a basic frequency, and voiced/unvoiced information in the previous frame. The voiced/unvoiced information binary decision unit 30 performs a binary decision for the voicing level in the current voicing level decision band by comparing the normalized spectrum difference energy with the threshold.
Therefore, if the spectrum difference energy in the current voicing level decision band is higher than the threshold, the value of the voicing level in the current voicing level decision band is determined to be 0, which means an unvoiced band. Conversely, if the spectrum difference energy in the current voicing level decision band is lower than the threshold, the value of the voicing level in the current voicing level decision band is determined to be 1, which means a voiced band. Currently, the three harmonic bands are combined and set as one voicing level decision band to decrease the encoding bit rate, and the maximum number of voiced degree decision bands is limited to 12.
The encoder transmits the obtained binary voiced/unvoiced decision information. The decoder synthesizes the unvoiced signal using the binary voiced/unvoiced decision information transmitted from the encoder, if the value of the binary voiced/unvoiced decision information is 0 in each harmonic band. Alternatively, it synthesizes voiced signals and then finally adds the unvoiced signal and the voiced signal in the current band.
The conventional method used in the conventional voiced/unvoiced information estimation system will be explained with reference to FIG. 2. First, an input spectrum is obtained by Fourier transformation of a voice input signal in S11. FIG. 3A illustrates a voice spectrum in a time domain. FIG. 3B illustrates a voice spectrum in a frequency (harmonic) domain after Fourier transformation. In addition, a synthetic spectrum is obtained by using a fundamental frequency, harmonic parameters, and a window spectrum.
When an input spectrum and a synthetic spectrum are obtained in S13, a plurality of harmonic bands, i.e., three harmonic bands, are combined and are set as one voicing level decision band. That is, the first three harmonic bands of a plurality of harmonic bands are combined and set as the first (k=1) voiced degree decision band, and the second three harmonic bands are bonded and set as the second (k=2) voicing level decision band. In this way, harmonic bands are set as the first voicing level decision band through the last (k=K) voicing level decision band. Here, the three harmonic bands are set as one voicing level decision band to decrease the encoding bit rate, and the maximum number of voicing level decision band is usually limited to 12.
When each voicing level decision band is set in S15, the spectrum difference calculation unit 10 performs a normalization process for obtaining a difference between the input spectrum and the synthetic spectrum in the first (k=1) voicing level decision band. The difference is then divided by the input spectrum energy in the current voicing level decision band to obtain the first normalized spectrum difference energy Ek.
When the first normalized spectrum difference energy Ek is obtained in S17, the threshold calculation unit 20 calculates a threshold ξk for deciding the voicing level in the first voicing level decision band by using the voiced/unvoiced information in the previous frame.
When the calculation of the threshold ξk is completed in S19, the voiced/unvoiced binary decision unit 30 compares the normalized spectrum difference energy Ek in the first voicing level decision band with the threshold ξk.
If the normalized spectrum difference energy Ek in the first voicing level decision band is lower than the threshold ξk, the voiced/unvoiced binary decision unit 30 determines the value Vk of the voicing level in the current voicing level decision band to be 1 and the current voicing level decision band to be a voiced band in S21. On the contrary, if the normalized spectrum difference energy Ek in the current voicing level decision band is higher than the threshold ξk, the voiced/unvoiced binary decision unit 30 determines the value Vk of the voicing level in the current voicing level decision band to be 0 and the current voicing level decision band to be an unvoiced band in S24.
In S25, it is judged whether or not the current voicing level decision band, i.e, the first (k=1) voicing level decision band, is the last (k=K) voicing level decision band of a predetermined total number K of voicing level decision bands (for example, 12 voicing level decision bands).
Since the first (k=1) voicing level decision band is not the last (k=K) voicing level decision band, the value Vk of a voicing level in the second voicing level decision band is decided by performing the above-described process for the second (k=2) voicing level decision band in S27.
Accordingly, the last (k=K) voicing level decision band, i.e., the 12th voicing level decision band, is decided to be a voiced band or a unvoiced band by sequentially performing the process of obtaining the value of a voicing level Vk for each voicing level decision band. When this occurs, the voiced information estimation process is finished without proceeding to the next step.
It is often the case where a voiced element and an unvoiced element are mixed in a certain voicing level decision band when observing a voice spectrum. However, according to the conventional voice information estimation method, one voiced/unvoiced information is decided to be a binary value (either 0 or 1) with respect to three harmonic bands. As a result, a spectrum in the harmonic band is represented as a voiced sound or an unvoiced sound. Thus, if voiced/unvoiced elements are mixed in the same voicing level decision band, it is difficult to accurately represent a spectrum as a voiced sound or unvoiced sound. In addition, the reproduced audio quality sounds unnatural.
The reason for setting three harmonic bands as one voicing level decision band is to decrease the number of quantization bits, which lowers the frequency resolution for voiced/unvoiced information.
In addition, since the voiced/unvoiced information is binary, it is very likely to drastically reduce the audio quality for the threshold. That is, because there is no value representing an intermediate level, the voiced/unvoiced information can be represented as the opposite value completely different from the original value if the threshold is wrongly calculated. Because the number of voiced/unvoiced information having a binary value becomes the quantity of quantization bits, it is necessary to expand the voicing level decision band in order to reduce the quantity of bits. This increasingly lowers the resolution for the frequency of the voiced/unvoiced information, and the voiced/unvoiced information decision process needs to be modified.