The present invention generally relates to the field of coding and decoding synthesized speech and, more particularly, to an adaptive multi-rate wideband speech codec.
Many methods of coding speech today are based upon linear predictive (LP) coding, which extracts perceptually significant features of a speech signal directly from a time waveform rather than from a frequency spectra of the speech signal (as does what is called a channel vocoder or what is called a formant vocoder). In LP coding, a speech waveform is first analyzed (LP analysis) to determine a time-varying model of the vocal tract excitation that caused the speech signal, and also a transfer function. A decoder (in a receiving terminal in case the coded speech signal is telecommunicated) then recreates the original speech using a synthesizer (for performing LP synthesis) that passes the excitation through a parameterized system that models the vocal tract. The parameters of the vocal tract model and the excitation of the model are both periodically updated to adapt to corresponding changes that occurred in the speaker as the speaker produced the speech signal. Between updates, i.e. during any specification interval, however, the excitation and parameters of the system are held constant, and so the process executed by the model is a linear time-invariant process. The overall coding and-decoding (distributed) system is called a codec.
In a codec using LP coding to generate speech, the decoder needs the coder to provide three inputs: a pitch period if the excitation is voiced, a gain factor and predictor coefficients. (In some codecs, the nature of the excitation, i.e. whether it is voiced or unvoiced, is also provided, but is not normally needed in case of an Algebraic Code Excited Linear Predictive (ACELP) codec, for example. LP coding is predictive in that it uses prediction parameters based on the actual input segments of the speech waveform (during a specification interval) to which the parameters are applied, in a process of forward estimation.
Basic LP coding and decoding can be used to digitally communicate speech with a relatively low data rate, but it produces synthetic sounding speech because of its using a very simple system of excitation. A so-called Code Excited Linear Predictive (CELP) codec is an enhanced excitation codec. It is based on xe2x80x9cresidualxe2x80x9d encoding. The modeling of the vocal tract is in terms of digital filters whose parameters are encoded in the compressed speech. These filters are driven, i.e. xe2x80x9cexcited,xe2x80x9d by a signal that represents the vibration of the original speaker""s vocal cords. A residual of an audio speech signal is the (original) audio speech signal less the digitally filtered audio speech signal. A CELP codec encodes the residual and uses it as a basis for excitation, in what is known as xe2x80x9cresidual pulse excitation.xe2x80x9d However, instead of encoding the residual waveforms on a sample-by-sample basis, CELP uses a waveform template selected from a predetermined set of waveform templates in order to represent a block of residual samples. A codeword is determined by the coder and provided to the decoder, which then uses the codeword to select a residual sequence to represent the original residual samples.
According to the Nyquist theorem, a speech signal with a sampling rate Fs can represent a frequency band from 0 to 0.5 Fs. Nowadays, most speech codecs (coders-decoders) use a sampling rate of 8 kHz. If the sampling rate is increased from 8 kHz, naturalness of speech improves because higher frequencies can be represented. Today, the sampling rate of the speech signal is usually 8 kHz, but mobile telephone stations are being developed that will use a sampling rate of 16 kHz. According to the Nyquist theorem, a sampling rate of 16 kHz can represent speech in the frequency band 0-8 kHz. The sampled speech is then coded for communication by a transmitter, and then decoded by a receiver. Speech coding of speech sampled using a sampling rate of 16 kHz is called wideband speech coding.
When the sampling rate of speech is increased, coding complexity also increases. With some algorithms, as the sampling rate increases, coding complexity can even increase exponentially. Therefore, coding complexity is often a limiting factor in determining an algorithm for wideband speech coding. This is especially true, for example, with mobile telephone stations where power consumption, available processing power, and memory requirements critically affect the applicability of algorithms.
In the prior-art wideband codec, as shown in FIG. 1, a pre-processing stage is used to low-pass filter and down-sample the input speech signal from the original sampling frequency of 16 kHz to 12.8 kHz. The down-sampled signal is then decimated so that the number of samples of 320 within a 20 ms period are reduced to 256. The down-sampled and decimated signal, with an effective frequency bandwidth of 0 to 6.4 kHz, is encoded using an Analysis-by-Synthesis (A-b-S) loop to extract LPC, pitch and excitation parameters, which are quantized into an encoded bit stream to be transmitted to the receiving end for decoding. In the A-b-S loop, a locally synthesized signal is further up sampled and interpolated to meet the original sample frequency. After the encoding process, the frequency band of 6.4 kHz to 8.0 kHz is empty. The wideband codec generates random noise on this empty frequency range and colors the random noise with LPC parameters by synthesis filtering as described below.
The random noise is first scaled according to
escaled=sqrt[{excT(n)exc(n)}/{eT(n)e(n)}]e(n)xe2x80x83xe2x80x83(1)
where e(n) represents the random noise and exc(n) denotes the LPC excitation. The superscript T denotes the transpose of a vector. The scaled random noise is filtered using the coloring LPC synthesis filter and a 6.0-7.0 kHz band pass filter. This colored, high-frequency component is further scaled using the information about the spectral tilt of the synthesized signal. The spectral tilt is estimated by calculating the first autocorrelation coefficient, r, using the following equation:
r={sT(i)s(ixe2x88x921)}/{sT(i)s(i)}xe2x80x83xe2x80x83(2)
where s(i) is the synthesized speech signal. Accordingly, the estimated gain fest is determined from
fest=1.0xe2x88x92rxe2x80x83xe2x80x83(3)
with the limitation 0.2xe2x89xa6festxe2x89xa61.0.
At the receiving end, after the core decoding process, the synthesized signal is further post-processed to generate the actual output by up-sampling the signal to meet the input signal sampling frequency. Because the high frequency noise level is estimated based on the LPC parameters obtained from the lower frequency band and the spectral tilt of the synthesized signal, the scaling and coloring of the random noise can be carried out in the encoder end or the decoder end.
In the prior-art codec, the high frequency noise level is estimated based on the base layer signal level and spectral tilt. As such, the high frequency components in the synthesized signal are filtered away. Hence, the noise level does not correspond to the actual input signal characteristics in the 6.4-8.0 kHz frequency range. Thus, the prior-art codec does not provide a high quality synthesized signal.
It is advantageous and desirable to provide a method and a system capable of providing a high quality synthesized signal taking into consideration the actual input signal characteristics in the high frequency range.
It is a primary objective of the present invention to improve the quality of synthesized speech in a distributed speech processing system. This objective can be achieved by using the input signal characteristics of the high frequency components in the original speech signal in the 6.0 to 7.0 kHz frequency range, for example, to determine the scaling factor of a colored, high-pass filtered artificial signal in synthesizing the higher frequency components of the synthesized speech during active speech periods. During non-active speech periods, the scaling factor can be determined by the lower frequency components of the synthesized speech signal.
Accordingly, the first aspect of the present invention is a method of speech coding for encoding and decoding an input signal having active speech periods and non-active speech periods, and for providing a synthesized speech signal having higher frequency components and lower frequency components, wherein the input signal is divided into a higher frequency band and lower frequency band in encoding and speech synthesizing processes and wherein speech related parameters characteristic of the lower frequency band are used to process an artificial signal for providing the higher frequency components of the synthesized speech signal. The method comprises the steps of:
scaling the processed artificial signal with a first scaling factor during the active speech periods, and
scaling the processed artificial signal with a second scaling factor during the non-active speech periods, wherein the first scaling factor is characteristic of the higher frequency band of the input signal, and the second scaling factor is characteristic of the lower frequency components of the synthesized speech.
Preferably, the input signal is high-pass filtered for providing a filtered signal in a frequency range characteristic of the higher frequency components of the synthesized speech, wherein the first scaling factor is estimated from the filtered signal, and wherein when the non-active speech periods include speech hangover periods and comfort noise periods, the second scaling factor for scaling the processed artificial signal in the speech hangover periods is estimated from the filtered signal.
Preferably, the second scaling factor for scaling the processed artificial signal during the speech hangover periods is also estimated from the lower frequency components of the synthesized speech, and the second scaling factor for scaling the processed artificial signal during the comfort noise periods is estimated from the lower frequency components of the, synthesized speech signal.
Preferably, the first scaling factor is encoded and transmitted within the encoded bit stream to a receiving end and the second scaling factor for the speech hangover periods is also included in the encoded bit stream.
It is possible that the second scaling factor for speech hangover periods is determined in the receiving end.
Preferably, the second scaling factor is also estimated from a spectral tilt factor determined from the lower frequency components of the synthesized speech.
Preferably, the first scaling factor is further estimated from the processed artificial signal.
The second aspect of the present invention is a speech signal transmitter and receiver system for encoding and decoding an input signal having active speech periods and non-active speech periods and for providing a synthesized speech signal having higher frequency components and lower frequency components, wherein the input signal is divided into a higher frequency band and a lower frequency band in the encoding and speech synthesizing processes, wherein speech related parameters characteristic of the lower frequency band of the input signal are used to process an artificial signal in the receiver for providing the higher frequency components of the synthesized speech. The system comprises:
a decoder in the receiver for receiving an encoded bit stream from the transmitter, wherein the encoded bit stream contains the speech related parameters;
a first module in the transmitter, responsive to the input signal, for providing a first scaling factor for scaling the processed artificial signal during the active periods, and
a second module in the receiver, responsive to the encoded bit stream, for providing a second scaling factor for scaling the processed artificial signal during the non-active periods, wherein the first scaling factor is characteristic of the higher frequency band of the input signal and the second scaling factor is characteristic of the lower frequency components of the synthesized speech.
Preferably, the first module includes a filter for high pass filtering the input signal and providing a filtered input signal having a frequency range corresponding to the higher frequency components of the synthesized speech so as to allow the first scaling factor to be estimated from the filtered input signal.
Preferably, a third module in the transmitter is used for providing a colored, high-pass filtered random noise in the frequency range corresponding to the higher frequency components of the synthesized signal so that the first scaling factor can be modified based on the colored, high-pass filtered random noise.
The third aspect of the present invention is an encoder for encoding an input signal having active speech periods and non-active speech periods, and the input signal is divided into a higher frequency band and a lower frequency band, and for providing an encoded bit stream containing speech related parameters characteristic of the lower frequency band of the input signal so as to allow a decoder to reconstruct the lower frequency components of synthesized speech based on the speech related parameters and to process an artificial signal based on the speech related parameters for providing high frequency components of the synthesized speech, and wherein a scaling factor based on the lower frequency components of the synthesized speech is used to scale the processed artificial signal during the non-active speech periods. The encoder comprises:
a filter, responsive to the input signal, for high-pass filtering the input signal in a frequency range corresponding to the higher frequency components of the synthesized speech, and providing a first signal indicative of the high-pass filtered input signal;
means, responsive to the first signal, for providing a further scaling factor based on the high-pass filtered input signal and the lower frequency components of the synthesized speech and for providing a second signal indicative of the further scaling factor; and
a quantization module, responsive to the second signal, for providing an encoded signal indicative of the further scaling factor in the encoded bit stream, so as to allow the decoder to scale the processed artificial signal during the active-speech periods based on the further scaling factor.
The fourth aspect of the present invention is a mobile station, which is arranged to transmit an encoded bit stream to a decoder for providing synthesized speech having higher frequency components and lower frequency components, wherein the encoded bit stream includes speech data indicative of an input signal having active speech periods and non-active periods, and the input signal is divided into a higher frequency band and lower frequency band, wherein the speech data includes speech related parameters characteristic of the lower frequency band of the input signal so as to allow the decoder to provide the lower frequency components of the synthesized speech based on the speech related parameters, and to color an artificial signal based on the speech related parameters and scale the colored artificial signal with a scaling factor based on the lower frequency components of the synthesized speech for providing the high frequency components of the synthesized speech during the non-active speech periods. The mobile station comprises:
a filter, responsive to the input signal, for high-pass filtering the input signal in a frequency range corresponding to the higher frequency components of the synthesized speech, and for providing a further scaling factor based on the high-pass filtered input signal; and
a quantization module, responsive to the scaling factor and the further scaling factor, for providing an encoded signal indicative of the further scaling factor in the encoded bit stream, so as to allow the decoder to scale the colored artificial signal during the active-speech period based on the further scaling factor.
The fifth aspect of the present invention is an element of a telecommunication network, which is arranged to receive an encoded bit stream containing speech data indicative of an input signal from a mobile station for providing synthesized speech having higher frequency components and lower frequency components, wherein the input signal, having active speech periods and non-active periods, and the input signal are divided into a higher frequency band and lower frequency band, wherein the speech data includes speech related parameters characteristic of the lower frequency band of the input signal. The element comprises:
a first mechanism, responsive to the speech data, for providing the lower frequency components of the synthesized speech based on the speech related parameters, and for providing a first signal indicative of the lower frequency components of the synthesized speech;
a second mechanism, responsive to the speech data, for synthesis and high-pass filtering an artificial signal for providing a second signal indicative of the synthesis and high-pass filtered artificial signal;
a third mechanism, responsive to the first signal, for providing a first scaling factor based on the lower frequency components of the synthesized speech;
a fourth mechanism, responsive to the encoded bit stream, for providing a second scaling factor based on gain parameters characteristic of the higher frequency band of the input signal, wherein the gain parameters are included in the encoded bit stream; and
a fifth mechanism, responsive to the second signal, for scaling the synthesis and high-pass filtered artificial signal with the first and second scaling factors during non-active speech periods and active speech periods, respectively.
The present invention will become apparent upon reading the description taken in conjunction with FIGS. 2 to 7.