The present invention relates to the field of coding and decoding synthesized speech. More particularly, the present invention relates to such coding and decoding of wideband speech.
wideband signal: Signal that has a sampling rate of Fswide, often having a value of 16 kHz.
lower band signal: Signal that contains frequencies from 0.0 Hz to 0.5Fslower from the corresponding wideband signal and has the sampling rate of Fslower, for example 12 kHz, which is smaller than Fswide.
higher band signal: Signal that contains frequencies from 0.5Fslower to 0.5Fswide from the corresponding wideband signal and has the sampling rate of Fshigher, for example 4 KHz, and usually Fswide=Fslower+Fshigher.
residual: The output signal resulting from an inverse filtering operation.
excitation search: A search of codebooks for an excitation signal or a set of excitation signals that substantially match a given residual. The output of an excitation search process, conducted by an analysis-by-synthesis module, are parameters (codewords) that describe the excitation signal or set of excitation signals that are found to match the residual. The parameters include two code vectors, one from an adaptive codebook, which includes excitations that are adapted for every subframe, and one from a fixed codebook, which includes a fixed set of excitations, i.e. non-adapted.
x(n) A residual signal (innovation), i.e. a target signal for adaptive codebook search.
exc(n) An excitation signal intended to match the residual x(n).
A(z) The inverse filter with unquantized coefficients. The inverse filter removes short-term correlation from a speech signal. It models an inverse frequency response of the vocal tract of a (real or imagined) speaker.
Â(z) The inverse filter with quantified (quantized) coefficients.
H(z)=1/Â(z) A speech synthesis filter with quantified coefficients.
frame: A time interval usually equal to 20 ms (corresponding to 160 samples at an 8 kHz sampling rate). LP analysis is performed frame by frame.
subframe: A time interval usually equal to 5 ms (corresponding to 40 samples at an 8 kHz sampling rate). Excitation searching is performed subframe by subframe.
s(n) An original speech signal (to be encoded).
sxe2x80x2(n) A windowed speech signal.
ŝ(n) A reconstructed (by a decoder) speech signal.
h(n) The impulse response of an LP synthesis filter.
LSP a line spectral pair, i.e. the transformation of LPC parameters. Line spectral pairs are obtained by decomposing the inverse filter transfer function A(z) into a set of two transfer functions, each a polynomial, one having even symmetry and the other having odd symmetry. The line spectral pairs are the roots of these polynomials on a z-unit circle. A set of LSP indices are used as one representation of an LP filter.
Tol Open-loop lag (associated with a pitch period, or a multiple or sub-multiple of a pitch period).
Rw[] Correlation coefficients that are used as a representation of an LP filter.
LP coefficients: Generic term for describing short-term synthesis filter coefficients.
short term synthesis filter: A filter that adds to an excitation signal a short-term correlation that models the impulse response of a vocal tract.
perceptual weighting filter: A filter used in an analysis by synthesis search of codebooks. It exploits the noise-masking properties of formants (vocal tract resonances) by weighting the error less near the formant frequencies.
zero-input response: The output of a synthesis filter due to past inputs but no present input, i.e. due solely to the present state of a filter resulting from past inputs.
Many methods of coding speech today are based upon linear predictive (LP) coding, which extracts perceptually significant features of a speech signal directly from a time waveform rather than from a frequency spectra of the speech signal (as does what is called a channel vocoder or what is called a formant vocoder). In LP coding, a speech waveform is first analyzed (LP analysis) to determine a time-varying model of the vocal tract excitation that caused the speech signal, and also a transfer function. A decoder (in a receiving terminal in case the coded speech signal is telecommunicated) then recreates the original speech using a synthesizer (for performing LP synthesis) that passes the excitation through a parameterized system that models the vocal tract. The parameters of the vocal tract model and the excitation of the model are both periodically updated to adapt to corresponding changes that occurred in the speaker as the speaker produced the speech signal. Between updates, i.e. during any specification interval, however, the excitation and parameters of the system are held constant, and so the process executed by the model is a linear time-invariant process. The overall coding and decoding (distributed) system is called a codec.
In a codec using LP coding, to generate speech, the decoder needs the coder to provide three inputs: a pitch period if the excitation is voiced; a gain factor; and predictor coefficients. (In some codecs, the nature of the excitation, i.e. whether it is voiced or unvoiced, is also provided, but is not normally needed in case of for example an ACELP codec.) LP coding is predictive in that it uses prediction parameters based on the actual input segments of the speech waveform (during a specification interval) to which the parameters are applied, in a process of forward estimation.
Basic LP coding and decoding can be used to digitally communicate speech with a relatively low data rate, but it produces synthetic sounding speech because of its using a very simple system of excitation. A so-called code excited linear predictive (CELP) codec is an enhanced excitation codec. It is based on xe2x80x9cresidualxe2x80x9d encoding. The modeling of the vocal tract is in terms of digital filters whose parameters are encoded in the compressed speech. These filters are driven, i.e. xe2x80x9cexcited,xe2x80x9d by a signal that represents the vibration of the original speaker""s vocal cords. A residual of an audio speech signal is the (original) audio speech signal less the digitally filtered audio speech signal. A CELP codec encodes the residual and uses it as a basis for excitation, in what is known as xe2x80x9cresidual pulse excitation.xe2x80x9d However, instead of encoding the residual waveforms on a sample-by-sample basis, CELP uses a waveform template selected from a predetermined set of waveform templates in order to represent a block of residual samples. A codeword is determined by the coder and provided to the decoder, which then uses the codeword to select a residual sequence to represent the original residual samples.
FIG. 1A shows elements of a transmitter/encoder system and elements of a receiver/decoder system, the overall system serving as a codec, and based on an LP codec, which could be a CELP-type codec. The transmitter accepts a sampled speech signal s(n) and provides it to an analyzer that determines LP parameters (inverse filter and synthesis filter) for a codec. s(n) is the inverse filtered signal used to determine the residual x(n). The excitation search module encodes for transmission both the residual x(n), as a quantified or quantized error xq(n), and the synthesizer parameters and applies them to a communication channel leading to the receiver. On the receiver (decoder system) side, a decoder module extracts the synthesizer parameters from the transmitted signal and provides them to a synthesizer. The decoder module also determines the quantified error xq(n) from the transmitted signal. The output from the synthesizer is combined with the quantified error xq(n) to produce a quantified value sq(n) representing the original speech signal s(n).
A transmitter and receiver using a CELP-type codec functions in a similar way, except that the error xq(n) is transmitted as an index into a codebook representing various waveforms suitable for approximating the errors (residuals) x(n). In the embodiment of a codec shown in FIG. 1A, in case of a CELP-type codec, the synthesis filter 1/Ã(z) can be expressed as:             1                        A          ~                ⁡                  (          z          )                      =          1      /              [                  1          +                                    a              1                        ⁢                          z                              -                1                                              +                                    a              2                        ⁢                          z                              -                2                                              +                                    a              3                        ⁢                          z                              -                3                                              +          …          +                                    a              10                        ⁢                          z                              -                10                                                    ]              ,
where the ai are the unquantized linear prediction parameters.
According to the Nyquist theorem, a speech signal with a sampling rate Fs can represent a frequency band from 0 to 0.5Fs. Nowadays, most speech codecs (coders-decoders) use a sampling rate of 8 kHz. If the sampling rate is increased from 8 kHz, naturalness of speech improves because higher frequencies can be represented. Today, the sampling rate of the speech signal is usually 8 kHz, but mobile telephone stations are being developed that will use a sampling rate of 16 kHz. According to the Nyquist theorem, a sampling rate of 16 kHz can represent speech in the frequency band 0-8 kHz. The sampled speech is then coded for communication by a transmitter, and then decoded by a receiver. Speech coding of speech sampled using a sampling rate of 16 kHz is called wideband speech coding.
When the sampling rate of speech is increased, coding complexity also increases. With some algorithms, as the sampling rate increases, coding complexity can even increase exponentially. Therefore, coding complexity is often a limiting factor in determining an algorithm for wideband speech coding. This is especially true, for example, with mobile telephone stations where power consumption, available processing power, and memory requirements critically affect the applicability of algorithms.
Sometimes in speech coding, a procedure known as decimation is used to reduce the complexity of the coding. Decimation reduces the original sampling rate for a sequence to a lower rate. It is the opposite of a procedure known as interpolation. The decimation process filters the input data with a low-pass filter and then resamples the resulting smoothed signal at a lower rate. Interpolation increases the original sampling rate for a sequence to a higher rate.
Interpolation inserts zeros into the original sequence and then applies a special low-pass filter to replace the zero values with interpolated values. The number of samples is thus increased.
A prior-art solution is to encode a wideband speech signal without decimation, but the complexity that results is too great for many applications. This approach is called full-band coding.
Another prior-art wideband speech codec limits complexity by using sub-band coding. In such a sub-band coding approach, before encoding a wideband signal, it is divided into two signals, a lower band signal and a higher band signal. Both signals are then coded, independently of the other. (FIG. 4 shows a simplified block diagram of an encoder according to such a prior-art solution.) In the decoder, in a synthesizing process, the two signals are recombined. Such an approach decreases coding complexity in those parts of the coding algorithm (such as the LP coding algorithm) where complexity increases exponentially as a function of the sampling rate. However, in the parts where the complexity increases linearly, such an approach does not decrease the complexity.
The problem with the prior art sub-band coding in which both bands are coded is that the energy of a speech signal is usually concentrated in either the lower band or the higher band. Thus, in coding both bands, using for example a linear predictive (LP) filter to yield quantizations of the signal in each band, the processing by one or the other of the two filters is usually of little value.
The coding complexity of the above sub-band coding prior-art solution can be further decreased by ignoring the analysis of the higher band in the encoder (blocks 42-46) and by replacing it with white noise in the decoder as shown in FIG. 5. The analysis of the higher band can be ignored because human hearing is not sensitive for the phase response of the high frequency band but only for the amplitude response. The other reason is that only noise-like unvoiced phonemes contain energy in the higher band, whereas the voiced signal, for which phase is important, does not have significant energy in the higher band. In this approach, as well as in the above sub-band coding that does not ignore analysis of the higher band in the encoder, the analysis filter models the lower band independently of the upper band. Because of this drastic simplification of the speech encoding and decoding problem, there is for some applications an unacceptable loss of fidelity in speech synthesis.
What is needed is a method of wideband speech coding that reduces-complexity compared to the complexity in coding the full wideband speech signal, regardless of the particular coding algorithm used, and yet offers substantially the same superior fidelity in representing the speech signal.
Accordingly, the present invention provides a system for encoding an nth frame in a succession of frames of a wideband (WB) speech signal and providing the encoded speech to a communication channel, as well as a corresponding decoder, a corresponding method, a corresponding mobile telephone, and a corresponding telecommunications system. The system for encoding the WB speech signal includes: a WB linear predictive (LP) analysis module responsive to the nth frame of the wideband speech signal, for providing LP analysis filter characteristics; a WB LP analysis filter also responsive to the nth frame of the WB speech signal, for providing a filtered WB speech input; a band-splitting module, responsive to the filtered WB speech input for the nth frame, for splitting the filtered WB speech input into k bands, the band-splitting module for providing a lower band (LB) target signal x(n); an excitation search module responsive to the LB target signal x(n), for providing an LB excitation exc(n); a band-combining module, responsive to the LB excitation exc(n), for providing a WB excitation excw(n); and a WB LP synthesis filter, responsive to the LP analysis filter characteristics and to the WB excitation excw(n), for providing WB synthesized speech. corresponding telecommunications system. The system for encoding the WB speech signal includes: a WB linear predictive (LP) analysis module (11) responsive to the nth frame of the wideband speech signal, for providing LP analysis filter characteristics; a WB LP analysis filter (12a), also responsive to the nth frame of the WB speech signal, for providing a filtered WB speech input; a band-splitting module (14), responsive to the filtered WB speech input for the nth frame, for splitting the filtered WB speech input into k bands, the band-splitting module for providing a lower band (LB) target signal x(n); an excitation search module (16), responsive to the LB target signal x(n), for providing an LB excitation exc(n); a band-combining module (17), responsive to the LB excitation exc(n), for providing a WB excitation excw(n); and a WB LP synthesis filter (18), responsive to the LP analysis filter characteristics and to the WB excitation excw(n), for providing WB synthesized speech.
In a further aspect of the system of encoding a WB speech signal, the band-splitting module further provides a higher-band (HB) target signal xh(n), and the system of encoding also includes: an excitation search module, responsive to the HB target signal xh(n), for providing an HB excitation exch(n); and, in addition, the band-combining module is further responsive to the HB excitation exch(n).
In a still further aspect of the encoding system, the band-splitting module determines the LB target signal x(n) by decimating the WB target signal xw(n), and the band-combining module includes a module for interpolating the LB excitation exc(n) to provide the WB excitation excw(n).
In one embodiment of this still further aspect of the encoding system, in decimating the WB target signal xw(n), a decimating delay is introduced that is compensated for by filtering a WB impulse response hw(n) from the end to the beginning of the frame using a decimating low-pass filter that limits the delay of the decimating to one sample per frame, and in interpolating the LB excitation exc(n), an interpolating delay is introduced that is compensated for by using an interpolating low-pass filter that limits the delay of the interpolating to one sample per frame.
The present invention is of use in particular in code excited linear predictive (CELP) type Analysis-by-Synthesis (A-b-S) coding of wideband speech. It can also be used in any other coding methodology that uses linear predictive (LP) filtering as a compression method.
Thus, in the present invention, LP analysis and LP synthesis of the full wideband speech signal is performed. In the excitation search part of the coder (the searching being for a codeword in case of CELP), the signal is divided into a lower band and a higher band. The lower band is searched using a decimated target signal, obtained by decimating the input speech signal after it is filtered through a wideband LP analysis filter as part of the LP analysis. In some embodiments, white noise is used for the higher band excitation because human hearing is not sensitive to the phase of the high frequency band; it is sensitive only to amplitude response. Another reason for using only white noise for the higher band excitation is that only noise-like unvoiced phonemes contain energy in the higher band, whereas the voiced signal, for which phase is important, does not have much energy in the higher band. In the decoder, the lower band excitation is first interpolated, and then the two excitations (the lower band excitation and either white noise or the higher band excitation) are added together and filtered through a wideband LP synthesis filter as part of the LP synthesis process. Such a method of coding keeps complexity low because of searching only the lower band for excitation, but keeps fidelity high because the speech signal is still reproduced over the whole wide frequency band.