1. Field of the Invention
The present invention relates to a method and device for recovering a high frequency content of a wideband signal previously down-sampled, and for injecting this high frequency content in an over-sampled synthesized version of the down-sampled wideband signal to produce a full-spectrum synthesized wideband signal.
2. Brief Description of the Prior Art
The demand for efficient digital wideband speech/audio encoding techniques with a good subjective quality/bit rate trade-off is increasing for numerous applications such as audio/video teleconferencing, multimedia, and wireless applications, as well as Internet and packet network applications. Until recently, telephone bandwidths filtered in the range 200–3400 Hz were mainly used in speech coding applications. However, there is an increasing demand for wideband speech applications in order to increase the intelligibility and naturalness of the speech signals. A bandwidth in the range 50–7000 Hz was found sufficient for delivering a face-to-face speech quality. For audio signals, this range gives an acceptable audio quality, but still lower than the CD quality which operates on the range 20–20000 Hz.
A speech encoder converts a speech signal into a digital bitstream which is transmitted over a communication channel (or stored in a storage medium). The speech signal is digitized (sampled and quantized with usually 16-bits per sample) and the speech encoder has the role of representing these digital samples with a smaller number of bits while maintaining a good subjective speech quality. The speech decoder or synthesizer operates on the transmitted or stored bit stream and converts it back to a sound signal.
One of the best prior art techniques capable of achieving a good quality/bit rate trade-off is the so-called Code Excited Linear Prediction (CELP) technique. According to this technique, the sampled speech signal is processed in successive blocks of L samples usually called frames where L is some predetermined number (corresponding to 10–30 ms of speech). In CELP, a linear prediction (LP) synthesis filter is computed and transmitted every frame. The L-sample frame is then divided into smaller blocks called subframes of size of N samples, where L=kN and k is the number of subframes in a frame (N usually corresponds to 4–10 ms of speech). An excitation signal is determined in each subframe, which usually consists of two components: one from the past excitation (also called pitch contribution or adaptive codebook) and the other from an innovative codebook (also called fixed codebook). This excitation signal is transmitted and used at the decoder as the input of the LP synthesis filter in order to obtain the synthesized speech.
An innovative codebook in the CELP context, is an indexed set of N-sample-long sequences which will be referred to as N-dimensional codevectors. Each codebook sequence is indexed by an integer k ranging from 1 to M where M represents the size of the codebook often expressed as a number of bits b, where M=2b.
To synthesize speech according to the CELP technique, each block of N samples is synthesized by filtering an appropriate codevector from a codebook through time varying filters modeling the spectral characteristics of the speech signal. At the encoder end, the synthesis output is computed for all, or a subset, of the codevectors from the codebook (codebook search). The retained codevector is the one producing the synthesis output closest to the original speech signal according to a perceptually weighted distortion measure. This perceptual weighting is performed using a so-called perceptual weighting filter, which is usually derived from the LP synthesis filter.
The CELP model has been very successful in encoding telephone band sound signals, and several CELP-based standards exist in a wide range of applications, especially in digital cellular applications. In the telephone band, the sound signal is band-limited to 200–3400 Hz and sampled at 8000 samples/sec. In wideband speech/audio applications, the sound signal is band-limited to 50–7000 Hz and sampled at 16000 samples/sec.
Some difficulties arise when applying the telephone-band optimized CELP model to wideband signals, and additional features need to be added to the model in order to obtain high quality wideband signals. Wideband signals exhibit a much wider dynamic range compared to telephone-band signals, which results in precision problems when a fixed-point implementation of the algorithm is required (which is essential in wireless applications). Further, the CELP model will often spend most of its encoding bits on the low-frequency region, which usually has higher energy contents, resulting in a low-pass output signal. To overcome this problem, the perceptual weighting filter has to be modified in order to suit wideband signals, and pre-emphasis techniques which boost the high frequency regions become important to reduce the dynamic range, yielding a simpler fixed-point implementation, and to ensure a better encoding of the higher frequency contents of the signal. Further, the pitch contents in the spectrum of voiced segments in wideband signals do not extend over the whole spectrum range, and the amount of voicing shows more variation compared to narrow-band signals. Thus, it is important to improve the closed-loop pitch analysis to better accommodate the variations in the voicing level.
Some difficulties arise when applying the telephone-band optimized CELP model to wideband signals, and additional features need to be added to the model in order to obtain high quality wideband signals.
As an example, in order to improve the coding efficiency and reduce the algorithmic complexity of the wideband encoding algorithm, the input wideband signal is down-sampled from 16 kHz to around 12.8 kHz. This reduces the number of samples in a frame, the processing time and the signal bandwidth below 7000 Hz to thereby enable reduction in bit rate down to 12 kbit/s while keeping very high quality decoded sound signal. The complexity is also reduced due to the lower number of samples per speech frame. At the decoder, the high frequency contents of the signal needs to be reintroduced to remove the low pass filtering effect from the decoded synthesized signal and retrieve the natural sounding quality of wideband signals. For that purpose, an efficient technique for recovering the high frequency content of the wideband signal is needed to thereby produce a full-spectrum wideband synthesized signal, while maintaining a quality close to the original signal.