1. Field of the Invention
The present invention relates to enhancing the crispness and clarity of narrowband speech and more specifically to an approach of extending the bandwidth of narrowband speech.
2. Discussion of Related Art
The use of electronic communication systems is widespread in most societies. One of the most common forms of communication between individuals is telephone communication. Telephone communication may occur in a variety of ways. Some examples of communication systems include telephones, cellular phones, Internet telephony and radio communication systems. Several of these examples—Internet telephony and cellular phones—provide wideband communication but when the systems transmit voice, they usually transmit at low bit-rates because of limited bandwidth.
Limits of the capacity of existing telecommunications infrastructure have seen huge investments in its expansion and adoption of newer wider bandwidth technologies. Demand for more mobile convenient forms of communication is also seen in increase in the development and expansion of cellular and satellite telephones, both of which have capacity constraints. In order to address these constraints, bandwidth extension research is ongoing to address the problem of accommodating more users over such limited capacity media by compressing speech before transmitting it across a network.
Wideband speech is typically defined as speech in the 7 to 8 kHz bandwidth, as opposed to narrowband speech, which is typically encountered in telephony with a bandwidth of less than 4 kHz. The advantage in using wideband speech is that it sounds more natural and offers higher intelligibility. Compared with normal speech, bandlimited speech has a muffled quality and reduced intelligibility, which is particularly noticeable in sounds such as /s/, /f/ and /sh/. In digital connections, both narrowband speech and wideband speech are coded to facilitate transmission of the speech signal. Coding a signal of a higher bandwidth requires an increase in the bit rate. Therefore, much research still focuses on reconstructing high-quality speech at low bit rates just for 4 kHz narrowband applications.
In order to improve the quality of narrowband speech without increasing the transmission bit rate, wideband enhancement involves synthesizing a highband signal from the narrowband speech and combining the highband signal with the narrowband signal to produce a higher quality wideband speech signal. The synthesized highband signal is based entirely on information contained in the narrowband speech. Thus, wideband enhancement can potentially increase the quality and intelligibility of the signal without increasing the coding bit rate. Wideband enhancement schemes typically include various components such as highband excitation synthesis and highband spectral envelope estimation. Recent improvements in these methods are known such as the excitation synthesis method that uses a combination of sinusoidal transform coding-based excitation and random excitation and new techniques for highband spectral envelope estimation. Other improvements related to bandwidth extension include very low bit rate wideband speech coding in which the quality of the wideband enhancement scheme is improved further by allocating a very small bitstream for coding the highband envelope and the gain. These recent improvements are explained in further detail in the PhD Thesis “Wideband Extension of Narrowband Speech for Enhancement and Coding”, by Julien Epps, at the School of Electrical Engineering and Telecommunications, the University of New South Wales, and found on the Internet at: http://www.library.unsw.edu.au/˜thesis/adt-NUN/public/adt-NUN20001018.155146/. Related published papers to the Thesis are J. Epps and W. H. Holmes, Speech Enhancement using STC-Based Bandwidth Extension, in Proc. Intl. Conf. Spoken Language Processing, ICSLP '98, 1998; and J. Epps and W. H. Holmes, A New Technique for Wideband Enhancement of Coded Narrowband Speech, in Proc. IEEE Speech Coding Workshop, SCW '99, 1999. The contents of this Thesis and published papers are incorporated herein for background material.
A direct way to obtain wideband speech at the receiving end is to either transmit it in analog form or use a wideband speech coder. However, existing analog systems, like the plain old telephone system (POTS), are not suited for wideband analog signal transmission, and wideband coding means relatively high bit rates, typically in the range of 16 to 32 kbps, as compared to narrowband speech coding at 1.2 to 8 kbps. In 1994, several publications have shown that it is possible to extend the bandwidth of narrowband speech directly from the input narrowband speech. In ensuing works, bandwidth extension is applied either to the original or to the decoded narrowband speech, and a variety of techniques that are discussed herein were proposed.
Bandwidth extension methods rely on the apparent dependence of the highband signal on the given narrowband signal. These methods further utilize the reduced sensitivity of the human auditory system to spectral distortions in the upper or high band region, as compared to the lower band where on average most of the signal power exists.
Most known bandwidth extension methods are structured according to one of the two general schemes shown in FIGS. 1A and 1B. The two structures shown in these figures leave the original signal unaltered, except for interpolating it to the higher sampling frequency, for example, 16 kHz. This way, any processing artifacts due to re-synthesis of the lower-band signal are avoided. The main task is therefore the generation of the highband signal. Although, when the input speech passes through the telephone channel it is limited to the frequency band of 300–3400 Hz and there could be interest in extending it also down to the low-band of 0 to 300 Hz. The difference between the two schemes shown in FIGS. 1A and 1B is in their complexity. Whereas in FIG. 1B, signal interpolation is done only once, in FIG. 1A an additional interpolation operation is typically needed within the highband signal generation block.
In general, when used herein, “S” denotes signals, fs denotes sampling frequencies, “nb” denotes narrowband, “wb” denotes wideband, “hb” denotes highband, and “˜” stands for “interpolated narrowband.”
As shown in FIG. 1A, the system 10 includes a highband generation module 12 and a 1:2 interpolation module 14 that receive in parallel the signal Snb, as input narrowband speech. The signal {tilde over (S)}nb is produced by interpolating the input signal by a factor of two, that is, by inserting a sample between each pair of narrowband samples and determining its amplitude based on the amplitudes of the surrounding narrowband samples via lowpass filtering. However, there is weakness in the interpolated speech in that it does not contain any high frequencies. Interpolation merely produces 4 kHz bandlimited speech with a sampling rate of 16 kHz rather than 8 kHz. To obtain a wideband signal, a highband signal Shb containing frequencies above 4 kHz needs to be added to the interpolated narrowband speech to form a wideband speech signal Ŝwb. The highband generation module 12 produces the signal Shb and the 1:2 interpolation module 14 produces the signal {tilde over (S)}nb. These signals are summed 16 to produce the wideband signal Ŝwb.
FIG. 1B illustrates another system 20 for bandwidth extension of narrowband speech. In this figure, the narrowband speech Snb, sampled at 8 kHz, is input to an interpolation module 24. The output from interpolation module 24 is at a sampling frequency of 16 kHz. The signal is input to both a highband generation module 22 and a delay module 26. The output from the highband generation module 22 Shb and the delayed signal output from the delay module 26 {tilde over (S)}nb are summed up 28 to produce a wideband speech signal Ŝwb at 16 kHz.
Reported bandwidth extension methods can be classified into two types—parametric and non-parametric. Non-parametric methods usually convert directly the received narrowband speech signal into a wideband signal, using simple techniques like spectral folding, shown in FIG. 2A, and non-linear processing shown in FIG. 2B.
These non-parametric methods extend the bandwidth of the input narrowband speech signal directly, i.e., without any signal analysis, since a parametric representation is not needed. The mechanism of spectral folding to generate the highband signal, as shown in FIG. 2A, involves upsampling 36 by a factor of 2 by inserting a zero sample following each input sample, highpass filtering with additional spectral shaping 38, and gain adjustment 40. Since the spectral folding operation reflects formants from the lower band into the upper band, i.e., highband, the purpose of the spectral shaping filter is to attenuate these signals in the highband. To reduce the spectral-gap about 4 kHz, which appears in spectrally folded telephone-bandwidth speech, a multirate technique is suggested as is known in the art. See, e.g., H. Yasukawa, Quality Enhancement of Band Limited Speech by Filtering and Multirate Techniques, in Proc. Intl. Conf. Spoken Language Processing, ICSLP '94, pp. 1607–1610, 1994; and H. Yasukawa, Enhancement of Telephone Speech Quality by Simple Spectrum Extrapolation Method, in Proc. European Conf. Speech Comm. and Technology, Eurospeech '95, 1995.
The wideband signal is obtained by adding the generated highband signal to the interpolated (1:2) input signal, as shown in FIG. 1A. This method suffers by failing to maintain the harmonic structure of voiced speech because of spectral folding. The method is also limited by the fixed spectral shaping and gain adjustment that may only be partially corrected by an adaptive gain adjustment.
The second method, shown in FIG. 2B, generates a highband signal by applying nonlinear processing 46 (e.g., waveform rectification) after interpolation (1:2) 44 of the narrowband input signal. Preferably, fullwave rectification is used for this purpose. Again, highpass and spectral shaping filters 48 with a gain adjustment 50 are applied to the rectified signal to generate the highband signal. Although a memoryless nonlinear operator maintains the harmonic structure of voiced speech, the portion of energy ‘spilled over’ to the highband and its spectral shape depends on the spectral characteristics of the input narrowband signal, making it difficult to properly shape the highband spectrum and adjust the gain.
The main advantages of the non-parametric approach are its relatively low complexity and its robustness, stemming from the fact that no model needs to be defined and, consequently, no parameters need to be extracted and no training is needed. These characteristics, however, typically result in lower quality when compared with parametric methods.
Parametric methods separate the processing into two parts as shown in FIG. 3. A first part 54 generates the spectral envelope of a wideband signal from the spectral envelope of the input signal, while a second part 56 generates a wideband excitation signal, to be shaped by the generated wideband spectral envelope 58. Highpass filtering and gain 60 extract the highband signal for combining with the original narrowband signal to produce the output wideband signal. A parametric model is usually used to represent the spectral envelope and, typically, the same or a related model is used in 58 for synthesizing the intermediate wideband signal that is input to block 60.
Common models for spectral envelope representation are based on linear prediction (LP) such as linear prediction coefficients (LPC) and line spectral frequencies (LSF), cepsral representations such as cepstral coefficients and mel-frequency cepstral coefficients (MFCC), or spectral envelope samples, usually logarithmic, typically extracted from an LP model. Almost all parametric techniques use an LPC synthesis filter for wideband signal generation (typically an intermediate wideband signal which is further highpass filtered), by exciting it with an appropriate wideband excitation signal.
Parametric methods can be further classified into those that require training, and those that do not and hence are simpler and more robust. Most reported parametric methods require training, like those that are based on vector quantization (VQ), using codebook mapping of the parameter vectors or linear, as well as piecewise linear, mapping of these vectors. Neural-net-based methods and statistical methods also use parametric models and require training.
In the training phase, the relationship or dependence between the original narrowband and highband (or wideband) signal parameters is extracted. This relationship is then used to obtain an estimated spectral envelope shape of the highband signal from the input narrowband signal on a frame-by-frame basis.
Not all parametric methods require training. A method that does not require training is reported in H. Yasukawa, Restoration of Wide Band Signal from Telephone Speech Using Linear Prediction Error Processing, in Proc. Intl. Conf. Spoken Language Processing, ICSLP 1996, pp. 901–904 (the “Yasukawa Approach”). The contents of this article are incorporated herein by reference for background material. The Yasukawa Approach is based on the linear extrapolation of the spectral tilt of the input speech spectral envelope into the upper band. The extended envelope is converted into a signal by inverse DFT, from which LP coefficients are extracted and used for synthesizing the highband signal. The synthesis is carried out by exciting the LPC synthesis filter by a wideband excitation signal. The excitation signal is obtained by inverse filtering the input narrowband signal and spectral folding the resulting residual signal. The main disadvantage of this technique is in the rather simplistic approach for generating the highband spectral envelope just based on the spectral tilt in the lower band.