The public switched telephony network (PSTN) and most of today's cellular networks use narrowband (0.3-3.4 kHz) speech coders. This in turn places limits on the naturalness and intelligibility of speech1 and is most problematic for sounds whose energy is spread over the entire audible spectrum. For example, unvoiced sounds such as ‘s’ and ‘f’ are often difficult to distinguish with a narrowband representation. In FIGS. 1A-1F, spectral plots for different phonemes are provided. For the fricatives (‘s’, ‘sh’, ‘z’) of FIGS. 1A-1C, respectively, the energy is spread throughout the spectrum; however most of the energy of the vowels (‘ae’, ‘aa’, ‘ay’) of FIGS. 1D-1F, respectively, lies within the low frequency range2. Split-band compression algorithms recover the narrowband spectrum (0.3-3.4 kHz) and the high band spectrum (3.4-7 kHz) separately. The main goal of these algorithms is to encode wideband (0.3-7 kHz) speech at the minimum possible bit rate. A number of these techniques make use of the correlation between the low band and the high band to predict the wideband speech from extracted narrowband features3,4,5,6,7. Some of these algorithms attempt to cleverly embed the high band parameters in the low frequency band8,9. Others generate coarse representations of the high band at the encoder and transmit them as side information to the decoder10,11,12,13,3,14,15.
A set of popular bandwidth extension algorithms attempt to recover wideband speech from narrowband content using predictive models. However, recent studies show that the mutual information between the narrowband and the high frequency bands is often insufficient for prediction-based wideband synthesis16,17,18. In the tables of FIGS. 2A and 2B, a predictability metric developed by Nilsson et al.16 is shown for the high band for two different scenarios. This predictability metric is a ratio of the mutual information between a set of low-band and high band features and the uncertainty (entropy) of the high band features. FIG. 2A provides a ratio between the mutual information of the narrowband cepstral coefficients (f) and the high band energy ratio (y), I (f), and the entropy of the high band energy ratio H (y), for different sounds. FIG. 2B provides a ratio between the mutual information of the narrowband cepstral coefficients (t) and the high band cepstral coefficients (y), I (f, y), and the entropy of the highband cepstral coefficients H (y), for different sounds. FIG. 2A shows the normalized mutual information between the narrowband cepstrum and the high band to low-band energy ratio, and FIG. 2B shows the same metric between the narrowband cepstrum and the high band cepstrum. As the tables show, the available narrowband information reduces uncertainty in the high band energy only by about 13% and in the high band cepstrum only by about 9%. These results imply that algorithms based on predicting the high band often generate erroneous estimates10. It is therefore evident that for improved robustness, the high band spectrum should be quantized and transmitted as side information.
A few split-band coders based on coarse high band representations have been recently proposed3,12,13,19. Although these techniques provide improved speech quality relative to prediction-based algorithms, most do not exploit opportunities to further reduce bit rates through perceptual modeling. In fact, the bit rates associated with the high band representation are often unnecessarily high because they allocate the same number of bits for high band generation to each frame3,13. It is apparent from FIG. 1 that a wideband representation is more beneficial for certain frame types (e.g. unvoiced fricatives). In an effort to further study which frames benefit from full-bandwidth representations, the partial loudness (PL) of the high band in the presence of the low band is analyzed20,21,22. The PL is a metric for estimating the contribution of the high band to the overall loudness of a speech segment. In FIG. 3, the PL for different phonemes is plotted. As shown in FIG. 3, for most phonemes the partial loudness of the high band is under 0.25 sones. Notably, the sone is a measure of loudness. One sone is defined as the loudness of a 1000 Hz tone at 40 dBSPL, presented binaurally from a frontal direction in free field. In fact, with the exception of a few fricatives, the high band contribution to the overall loudness of the frame is relatively small. As such, algorithms that perform bandwidth extension by encoding the high band of every frame often operate at unnecessarily high bit rates.
FIGS. 2A and 2B show that some side information should be transmitted to the decoder in order to accurately characterize certain wideband speech; the plot of FIG. 3, however, indicates that side information is not necessary for every frame. Accordingly, there is a need for an encoding technique that reduces the amount of side information use for the high band without affecting speech quality.