The present invention generally relates to the field of coding and decoding synthesized speech and, more particularly, to such coding and decoding of wideband speech.
Many methods of coding speech today are based upon linear predictive (LP) coding, which extracts perceptually significant features of a speech signal directly from a time waveform rather than from a frequency spectra of the speech signal (as does what is called a channel vocoder or what is called a formant vocoder). In LP coding, a speech waveform is first analyzed (LP analysis) to determine a time-varying model of the vocal tract excitation that caused the speech signal, and also a transfer function. A decoder (in a receiving terminal in case the coded speech signal is telecommunicated) then recreates the original speech using a synthesizer (for performing LP synthesis) that passes the excitation through a parameterized system that models the vocal tract. The parameters of the vocal tract model and the excitation of the model are both periodically updated to adapt to corresponding changes that occurred in the speaker as the speaker produced the speech signal. Between updates, i.e. during any specification interval, however, the excitation and parameters of the system are held constant, and so the process executed by the model is a linear-time-invariant process. The overall coding and decoding (distributed) system is called a codec.
In a codec using LP coding to generate speech, the decoder needs the coder to provide three inputs: a pitch period if the excitation is voiced, a gain factor and predictor coefficients. (In some codecs, the nature of the excitation, i.e. whether it is voiced or unvoiced, is also provided, but is not normally needed in case of an Algebraic Code Excited Linear Predictive (ACELP) codec, for example.) LP coding is predictive in that it uses prediction parameters based on the actual input segments of the speech waveform (during a specification interval) to which the parameters are applied, in a process of forward estimation.
Basic LP coding and decoding can be used to digitally communicate speech with a relatively low data rate, but it produces synthetic sounding speech because of its using a very simple system of excitation. A so-called Code Excited Linear Predictive (CELP) codec is an enhanced excitation codec. It is based on xe2x80x9cresidualxe2x80x9d encoding. The modeling of the vocal tract is in terms of digital filters whose parameters are encoded in the compressed speech. These filters are driven, i.e. xe2x80x9cexcited,xe2x80x9d by a signal that represents the vibration of the original speaker""s vocal cords. A residual of an audio speech signal is the (original) audio speech signal less the digitally filtered audio speech signal. A CELP codec encodes the residual and uses it as a basis for excitation, in what is known as xe2x80x9cresidual pulse excitation.xe2x80x9d However, instead of encoding the residual waveforms on a sample-by-sample basis, CELP uses a waveform template selected from a predetermined set of waveform templates in order to represent a block of residual samples. A codeword is determined by the coder and provided to the decoder, which then uses the codeword to select a residual sequence to represent the original residual samples.
FIG. 1 shows elements of a transmitter/encoder system and elements of a receiver/decoder system. The overall system serves as an LP codec, and could be a CELP-type codec. The transmitter accepts a sampled speech signal s(n) and provides it to an analyzer that determines LP parameters (inverse filter and synthesis filter) for a codec. sq(n) is the inverse filtered signal used to determine the residual x(n). The excitation search module encodes for transmission both the residual x(n), as a quantified or quantized error xq(n), and the synthesizer parameters and applies them to a communication channel leading to the receiver. On the receiver (decoder system) side, a decoder module extracts the synthesizer parameters from the transmitted signal and provides them to a synthesizer. The decoder module also determines the quantified error xq(n) from the transmitted signal. The output from the synthesizer is combined with the quantified error xq(n) to produce a quantified value sq(n) representing the original speech signal s(n).
A transmitter and receiver using a CELP-type codec functions in a similar way, except that the error xq(n) is transmitted as an index into a codebook representing various waveforms suitable for approximating the errors (residuals) x(n).
According to the Nyquist theorem, a speech signal with a sampling rate Fs can represent a frequency band from 0 to 0.5Fs. Nowadays, most speech codecs (coders-decoders) use a sampling rate of 8 kHz. If the sampling rate is increased from 8 kHz, naturalness of speech improves because higher frequencies can be represented. Today, the sampling rate of the speech signal is usually 8 kHz, but mobile telephone stations are being developed that will use a sampling rate of 16 kHz. According to the Nyquist theorem, a sampling rate of 16 kHz can represent speech in the frequency band 0-8 kHz. The sampled speech is then coded for communication by a transmitter, and then decoded by a receiver. Speech coding of speech sampled using a sampling rate of 16 kHz is called wideband speech coding.
When the sampling rate of speech is increased, coding complexity also increases. With some algorithms, as the sampling rate increases, coding complexity can even increase exponentially. Therefore, coding complexity is often a limiting factor in determining an algorithm for wideband speech coding. This is especially true, for example, with mobile telephone stations where power consumption, available processing power, and memory requirements critically affect the applicability of algorithms.
Sometimes in speech coding, a procedure known as decimation is used to reduce the complexity of the coding. Decimation reduces the original sampling rate for a sequence to a lower rate. It is the opposite of a procedure known as interpolation. The decimation process filters the input data with a low-pass filter and then re-samples the resulting smoothed signal at a lower rate. Interpolation increases the original sampling rate for a sequence to a higher rate. Interpolation inserts zeros into the original sequence and then applies a special low-pass filter to replace the zero values with interpolated values. The number of samples is thus increased.
Another prior-art wideband speech codec limits complexity by using sub-band coding. In such a sub-band coding approach, before encoding a wideband signal, it is divided into two signals, a lower band signal and a higher band signal. Both signals are then coded, independently of the other. In the decoder, in a synthesizing process, the two signals are recombined. Such an approach decreases coding complexity in those parts of the coding algorithm (such as the search for the innovative codebook) where complexity increases exponentially as a function of the sampling rate. However, in the parts where the complexity increases linearly, such an approach does not decrease the complexity.
The coding complexity of the above sub-band coding prior-art solution can be further decreased by ignoring the analysis of the higher band in the encoder and by replacing it with filtered white noise, or filtered pseudo-random noise, in the decoder, as shown in FIG. 2. The analysis of the higher band can be ignored because human hearing is not sensitive to the phase response of the high frequency band but only to the amplitude response. The other reason is that only noise-like unvoiced phonemes contain energy in the higher band, whereas the voiced signal, for which phase is important, does not have significant energy in the higher band. In this approach, the spectrum of the higher band is estimated with an LP filter that has been generated from the lower band LP filter. Thus, no knowledge of the higher frequency band contents is sent over the transmission channel, and the generation of higher band LP synthesis filtering parameters is based on the lower frequency band. White noise, an artificial signal, is used as a source for the higher band filtering with the energy of the noise being estimated from the characteristics of the lower band signal. Because both the encoder and the decoder know the excitation, and the Long Term Predictor (LTP) and fixed codebook gains for the lower band, it is possible to estimate the energy scaling factor and the LP synthesis filtering parameters for the higher band from these parameters. In the prior art approach, the energy of wideband white noise is equalized to the energy of lower band excitation. Subsequently, the tilt of the lower band synthesis signal is computed. In the computation of the tilt factor, the lowest frequency band is cut off and the equalized wideband white noise signal is multiplied by the tilt factor. The wideband noise is then filtered through the LP filter. Finally the lower band is cut off from the signal. As such, the scaling of higher band energy is based on the higher band energy scaling factor estimated from an energy scaler estimator, and the higher band LP synthesis filtering is based on the higher band LP synthesis filtering parameters provided by an LP filtering estimator, regardless of whether the input signal is speech or background noise. While this approach is suitable for processing signals containing only speech, it does not function properly when the input signals contains background noise, especially during non-speech periods.
What is needed is a method of wideband speech coding of input signals containing background noise, wherein the method reduces complexity compared to the complexity in coding the full wideband speech signal, regardless of the particular coding algorithm used, and yet offers substantially the same superior fidelity in representing the speech signal.
The present invention takes advantage of the voice activity information to distinguish speech and non-speech periods of an input signal so that the influence of background noise in the input signal is taken into account when estimating the energy scaling factor and the Linear Predictive (LP) synthesis filtering parameters for the higher frequency band of the input signal.
Accordingly, the first aspect of the method of speech coding for encoding and decoding an input signal having speech periods and non-speech periods and providing synthesized speech having higher frequency components and lower frequency components, wherein the input signal is divided into a higher frequency band and a lower frequency band in encoding and decoding processes, and wherein speech related parameters characteristic of the lower frequency band are used to process an artificial signal for providing the higher frequency components of the synthesized speech, and wherein the input signal includes a first signal in the speech periods and a second signal in the non-speech periods, said method comprising the steps of:
scaling and synthesis filtering the artificial signal in the speech periods based on speech related parameters representative of the first signal; and
scaling and synthesis filtering the artificial signal in the non-speech periods based on speech related parameters representative of the second signal, wherein the first signal includes a speech signal and the second signal includes a noise signal.
Preferably, the scaling and synthesis filtering of the artificial signal in the speech periods is also based on a spectral tilt factor computed from the lower frequency components of the synthesized speech.
Preferably, when the input signal includes a background noise, the scaling and synthesis filtering of the artificial signal in the speech periods is further based on a correction factor characteristic of the background noise.
Preferably, the scaling and synthesis filtering of the artificial signal in the non-speech periods is further based on the correction factor characteristics of the background noise.
Preferably, voice activity information is used to indicate the first and second signal periods.
The second aspect of the present invention is a speech signal transmitter and receiver system for encoding and decoding an input signal having speech periods and non-speech periods and providing synthesized speech having higher frequency components and lower frequency components, wherein the input signal is divided into a higher frequency band and a lower frequency band in the encoding and decoding processes, and wherein speech related parameters characteristic of the lower frequency band are used to process an artificial signal for providing the higher frequency components of the synthesized speech an artificial signal, and wherein the input signal includes a first signal in the speech periods and a second signal in the non-speech periods. The system comprises:
a decoder for receiving the encoded input signal and for providing the speech related parameters;
an energy scale estimator, responsive to the speech related parameters, for providing an energy scaling factor for scaling the artificial signal;
a linear predictive filtering estimator, responsive to the speech related parameters, for synthesis filtering the artificial signal; and
a mechanism, for providing information regarding the speech and non-speech periods so that the energy scaling factor for the speech periods and the non-speech periods are estimated based on the first and second signals, respectively.
Preferably, the information providing mechanism is capable of providing a first weighting correction factor for the speech periods and a different second weighting correction factor for the non-speech periods so as to allow the energy scale estimator to provide the energy scaling factor based on the first and second weighting correction factors.
Preferably, the synthesis filtering of the artificial signal in the speech periods and the non-speech periods is also based on the first weighting correction factor and the second weighting correction factor, respectively.
Preferably, the speech related parameters include linear predictive coding coefficients representative of the first signal.
The third aspect of the present invention is a decoder for synthesizing speech having higher frequency components and lower frequency components from encoded data indicative of an input signal having speech periods and non-speech periods, wherein the input signal is divided into a higher frequency band and a lower frequency band in the encoding and decoding processes, and the encoding of the input signal is based on the lower frequency band, and wherein the encoded data includes speech parameters characteristic of the lower frequency band for processing an artificial signal and providing the higher frequency components of the synthesized speech. The system comprises:
an energy scale estimator, responsive to the speech parameter, for providing a first energy scaling factor for scaling the artificial signal in the speech periods and a second energy scaling factor for scaling the artificial signal in the non-speech periods; and
a synthesis filtering estimator, for providing a plurality of filtering parameters for synthesis filtering the artificial signal.
Preferably, the decoder also comprises a mechanism for monitoring the speech periods and the non-speech periods so as to allow the energy scale estimator to change the energy scaling factors accordingly.
The fourth aspect of the present invention is a mobile station, which is arranged to receive an encoded bit stream containing speech data indicative of an input signal, wherein the input signal is divided into a higher frequency band and a lower frequency band, and the input signal includes a first signal in speech periods and a second signal in non-speech periods, and wherein the speech data includes speech related parameters obtained from the lower frequency band. The mobile station comprises:
a first means for decoding the lower frequency band using the speech related parameters;
a second means for decoding the higher frequency band from an artificial signal;
a third means, responding to the speech data, and for providing information regarding the speech and non-speech periods;
an energy scale estimator, responsive to the speech period information, for providing a first energy scaling factor based on the first signal and a second energy scaling factor based on the second signal for scaling the artificial signal; and
a predictive filtering estimator, responsive to the speech related parameters and the speech period information, for providing a first plurality of linear predictive filtering parameters based on the first signal and a second plurality of linear predictive filtering parameters for filtering the artificial signal.
The fifth aspect of the present invention is an element of a telecommunication network, which is arranged to receive an encoded bit stream containing speech data from a mobile station having means for encoding an input signal, where in the input signal is divided into a higher frequency band and a lower frequency band and the input signal includes a first signal in speech periods and a second signal is non-speech periods, and wherein the speech data includes speech related parameters obtained from the lower frequency band. The element comprising:
a first means for decoding the lower frequency band using the speech related parameters;
a second means for decoding the higher frequency band from an artificial signal;
a third means, responding to the speech data, for providing information regarding the speech and non-speech periods, and for providing speech period information;
an energy scale estimator, responsive to the speech period information, for providing a first energy scaling factor based on the first signal and a second energy scaling factor based on the second signal for scaling the artificial signal; and
a predictive filtering estimator, responsive to the speech related parameters and the speech period information, for providing a first plurality of linear predictive filtering parameters based on the first signal and a second plurality of linear predictive filtering parameters for filtering the artificial signal.
The present invention will become apparent upon reading the description taken in conjunction with FIGS. 3-6.