1. Field of the Invention
The present invention relates to a communication apparatus such as a mobile phone communicating through speech coding processing, particularly a speech decoder, speech decoding method, et cetera, comprised by the communication apparatus to improve voice clarity for ease of hearing of the received voice.
2. Description of the Related Art
Mobile phones have become widely spread in recent years. In mobile phone systems, speech coding techniques are used for compressing the voice in order to better utilize communication lines. Among such speech coding techniques, the CELP (Code Excited Linear Prediction) system is known as a coding method for providing good voice quality at a low bit rate and the CELP-based coding method is adopted by many voice coding standards such as ITU-T G. 729 system, 3GPP AMR system, et cetera. The CELP algorithm-based method is also the most commonly used technique for voice compression used for VoIP (Voiceover Internet Protocol), video conference system, et cetera, and is not limited to the mobile phone system.
Here, CELP is summarized. A speech coding method introduced by Messrs. M. R. Schroder and B. S. Atal in 1985, the CELP extracts parameters from the input voice based on a human voice creation model for transmitting the parameters through coding, thereby accomplishing highly efficient information compression.
FIG. 16 shows a voice creation model, in the process of which a vocal source signal generated by a vocal source (i.e., vocal chords) 110 is input to an articulatory system (i.e., vocal tract) 111, where a vocal tract characteristic is added, and a voice wave is finally output from the lips 112 (refer to the non-patent document 1). That is, the voice is made up of vocal source and vocal tract characteristics.
FIG. 17 shows the process flow of CELP coding and decoding.
FIG. 17 shows how a CELP coder and decoder are equipped in a mobile phone for example, and a voice signal (i.e., voice code code) is transmitted from the CELP coder 120 equipped in the transmitting mobile phone to the CELP decoder 130 equipped in the receiving mobile phone by way of a transmission path (not-shown; e.g., wireless communication line, mobile phone network, et cetera).
In the CELP coder 120 equipped in the transmitting mobile phone, a parameter extraction unit 121 analyzes the input voice based on the above mentioned voice generation model to separate the input voice into LPC (Linear Predictor Coefficients) indicating the vocal tract characteristics and a vocal source signal. The parameter extraction unit 121 further extracts an ACB (Adaptive CodeBook) vector indicating a cyclical component of the vocal source signal, an SCB (Stochastic CodeBook) vector indicating a non-cyclical component thereof, and a gain of each vector.
Then a coding unit 122 codes the LPC, ACB vector, SCB vector and the gain to generate the LPC code, ACB code, SCB code and gain code so that a code multiplexer unit 123 multiplexes them to generate a voice code code to transmit to the receiving mobile phone.
In the CELP decoder 130 equipped in the receiving mobile phone, a code separation unit 131 first separates the transmitted voice code code into the LPC code, ACB code, SCB code and gain code so that a decoder 132 decodes them to the LPC, ACB vector, SCB vector and gain, respectively. Then a voice synthesis unit 133 synthesizes a voice according to the decoded parameters.
The following detailed descriptions are of the CELP coder and the CELP decoder.
FIG. 18 is a block diagram of parameter extraction unit 121 equipped in the CELP coder.
In the CELP, an input voice is coded in the unit of frames of a certain length. First, an LPC analysis unit 141 calculates an LPC from the input voice according to a known LPC (Linear Prediction Coefficients) analysis method. The LPC is a filter coefficient when a vocal tract characteristic is approximated by an all pole linear filter.
Next, extracts a vocal source signal by using an AbS (Analysis by Synthesis) method. In the CELP, a voice is reproduced by inputting a vocal source signal to an LPC synthesis filter 142 constituted by the LPC. Therefore, a differential power evaluation unit 145 searches a combination of the CodeBooks where a differential error with the input voice becomes a minimum when a voice is synthesized by the LPC synthesis filter 142 from among the voice source candidates constituted by combinations among a plurality of ACB vectors stored in an ACB 143, a plurality of SCB vectors stored in an SCB 144 and the gains of the aforementioned two vectors to extract an ACB vector, SCB vector, ACB gain and SCB gain.
As described above, the coding unit 122 codes each parameter extracted by the above described operation to obtain an LPC code, ACB code, SCB code and gain code. The code multiplexer unit 123 multiplies each obtained code to transmit to the decoding side as a voice code code.
The next description is of the CELP decoder in further detail.
FIG. 19 shows a block diagram of the CELP decoder 130.
In the CELP decoder, the code separation unit 131 separates each parameter from the transmitted voice code code as described above to obtain an LPC code, an ACB code, an SCB code and a gain code.
Next, an LPC decoder 151, ACB vector decoder 152, SCB vector decoder 153 and gain decoder 154 all constituting the decoding unit 132 respectively decode the LPC code, the ACB code, the SCB code and the gain code to obtain an LPC, an ACB vector, an SCB vector and the gains (i.e., ACB gain and SCB gain), respectively.
The voice synthesis unit 133 generates a vocal source signal from the input ACB vector, SCB vector and the gains (i.e., ACB gain and SCB gain) by the shown configuration, and inputs the vocal source signal into the LPC synthesis filter 155 structured by the above described decoded LPC to thereby decode and output a voice.
Incidentally, a mobile phone is often used not only in a quiet place but also in a noisy environment surrounded by noise such as an airport or the platform of a railway station. In such a case the remote user is faced with a problem of difficulty in hearing the received voice impaired by the ambient noise. Not only that, in a video conference system for instance, which is usually used at home the user is surrounded by background noises such as those emitted by electric appliances such as air conditioners and the noise of the activity of people nearby.
As a countermeasure to such problems there are several known techniques to improve a received voice by improving clarity thereof by emphasizing the formants of the frequency spectrum of the receiving voice.
The following is a brief description of formants.
FIG. 20 exemplifies a frequency spectrum of a voice.
There is usually a plurality of peaks (showing relative maximum values) in the frequency spectrum of a voice, which are called formants. FIG. 20 exemplifies a spectrum with three formants (i.e., peaks), which are referred to as first, second and third formants from the lower frequency toward the higher frequency. The frequencies with relative maximum values, that is, the frequency of each of the formants, fp(1), fp(2) and fp(3), is called a formant frequency. Generally speaking, a frequency spectrum of a voice has the characteristic of the amplitude (i.e., power) decreasing with the frequency. Furthermore, it is known that the clarity of a voice is closely related with its formants, with an improved level of clarity possible by emphasizing the formants of higher levels (e.g., second and third formants).
FIG. 21 exemplifies formant emphasis on a voice spectrum.
The wave delineated by the solid line in FIG. 21 (a), and the wave delineated by the dotted line in FIG. 21 (b) are voice spectra before an emphasis. The wave delineated by the solid line in FIG. 21 (b) shows a voice spectrum after emphasis. The straight line in the figure indicates the inclination of the spectrum.
It is known that emphasizing the voice spectrum so as to increase the amplitude of higher level formants, flattening the inclination of whole spectrum as shown by FIG. 21 (b), improves the clarity of entire voice.
The, following techniques are known as such formant emphasis techniques.
The technique noted by the patent document 1 is an example of applying formant emphasis to a coded voice.
FIG. 22 shows a basic configuration of the invention noted in the patent document 1 which relates to a technique using a band division filter. As understood by FIG. 22, in the technique noted by the patent document 1, a spectrum estimation unit 160 figures out the spectrum of the input voice, and the convex/concave band decision unit 161 determines convex (i.e., peak) and concave (i.e., trough) bands based on the calculated spectrum and calculates an amplification ratio (or attenuation ratio) for the convex and concave bands.
Then, a filter configuration unit 162 provides a filter unit 163 with a coefficient for accomplishing the above described amplification ratio (or attenuation ratio) and inputs the input voice to the filter unit 163 for spectrum emphasis.
There has been a problem of emphasizing components other than the formants resulting in a degraded clarity, associated with the method using the band division filter because there has conventionally been no guarantee that a voice formant will be included in each frequency band.
Contrarily, the method noted by the patent document 1, being a method based on a band division filter, respectively amplifies and attenuates the peaks and troughs of the voice spectrum individually, thereby accomplishing emphasis of the voice.
Furthermore in the patent document 1, a voice decoding unit decodes an ABC vector, SCB vector and gains to generate a vocal source by using an ABC vector index, SCB vector index and gain index to generate a synthesis signal by filtering the voice source with a synthesis filter constituting an LPC decoded by the LPC index in the case of using the CELP method as presented by the seventh embodiment shown by FIG. 19 therein. Then the above described spectrum emphasis is accomplished by input of the synthesis signal and LPC to a spectrum emphasis unit.
Meanwhile, the invention proposed by patent document 2, being a voice signal processing apparatus applying to a post filter for a voice synthesis system comprised of a voice decoding apparatus for MBE (Multi-Band Excitation coding), is characterized by emphasizing the formants in the high frequencies of a frequency spectrum by maneuvering directly the amplitude value of each band as a parameter for frequency area. The formant emphasis method proposed in the patent document 2 is one estimating a band containing a formant based on the average amplitude of a plurality of frequency bands divided in accordance with a pitch frequency in the MBE method.
Meanwhile, the invention proposed by patent document 3, being an “analysis method by synthesis” with a reference signal which is a signal suppressing a noise gain, that is, a voice coding apparatus performing coding processing by using the A-b-S method, comprises a series of means for emphasizing the formant of the reference signal, dividing a signal into a voice component and a noise component and suppressing the level of the noise component. In the processing, an LPC is extracted from the input signal frame by frame and the above described formant emphasis is applied based on the LPC.
Meanwhile, the invention proposed by patent document 4 relates to a vocal source search (i.e., multi-pass search) for multi-pass voice coding, that is, aiming to improve the compression efficiency by searching a vocal source after emphasizing the voice in the linear spectrum, instead of searching the vocal source by using the input voice as is when searching the vocal source information through approximating by multi-pass.
[Patent document 1] Japanese unexamined patent application publication No. 2001-117573
[Patent document 2] Japanese unexamined patent application publication No. 6-202695
[Patent document 3] Japanese unexamined patent application publication No. 8-272394
[Patent document 4] Japanese registered patent No. 7-38118
[Non-patent document 1] “High efficiency coding of voice” authored by Kazuo Nakata pp. 69 through 71; published by Morikita Shuppan Co., Ltd.
The above noted conventional techniques are faced with problems respectively as described in the following.
First of all, the method noted in the patent document 1 is faced with the following problem.
As noted above, the patent document 1 shows an example method in the seventh embodiment shown by FIG. 7 therein to accomplish spectrum emphasis by the input of a synthesis signal and LPC to the spectrum emphasis unit, corresponding to the case of using the CELP method. A vocal source signal, however, is different from a vocal tract characteristic as understood by the above described voice generation model. The difference notwithstanding, the method noted by the patent document 1 makes it possible for a synthesized voice be emphasized by the emphasis filter obtained from the vocal tract characteristic, causing an enlarged distortion of the vocal source signal contained by the synthesized voice, sometimes resulting in side effects such as an increased sense of noise and a degraded clarity.
Meanwhile, the invention proposed by the patent document 2 aims at improving the quality of voice reproduced by an MBE vocoder (i.e., voice coder) as described above. Currently the mainstream technique of voice compression systems used for mobile phone systems, VoIP, video conference systems, et cetera, is based on the CELP algorithm using linear prediction. Therefore, an application of the technique noted by the patent document 2 is faced with the problem of further degradation of voice quality because the coding parameters for the MBE vocoder are extracted from a degraded quality of voice having been compressed and decompressed.
Meanwhile, the invention proposed by the patent document 3 makes it possible for a simple IIR filter using an LPC for emphasizing the formant, which is known as emphasizing the formant erroneously through a published research paper (e.g., Acoustical Society of Japan: Lecture Papers; published in March 2000; pp. 249 and 250), et cetera. In addition, the invention proposed by the patent document 3 basically relates to a voice coding apparatus instead of a voice decoding apparatus.
Meanwhile, the invention proposed by the patent document 4 aims at improving the efficiency of compression by searching a vocal source and specifically, when searching voice information through approximation by multi-pass, by searching the vocal source after emphasizing the voice in a linear spectrum instead of using the input voice as is, not aiming at clarity of voice.
The challenge of the present invention is to provide a speech decoder, a speech decoding method, the program thereof and a storage media for suppressing side effects of formant emphasis such as a degradation of voice quality and an increased sense of noisiness, and improving the clarity of reproduced voice and easy hearing of the receiving voice in equipment (e.g., mobile phone) using a speech coding method of an analysis-synthesis system.