The far most common way to receive speech signals is directly face-to-face with only the ear setting a lower frequency limit around 20 Hz and an upper frequency limit around 20 kHz. The common telephone narrowband speech signal bandwidth of 0.3-3.4 kHz is considerably narrower than what one would experience in a face-to-face encounter with a sound source, but it is sufficient to facilitate the reliable communication of speech. However, there would be a benefit to be obtained by extending this narrowband speech signal to a wider bandwidth in that the perceived naturalness of the speech signal would be increased.
Bandwidth extension methods previously suggested include codebook approaches (see, e.g., Y. Yoshida, M Abe, An algorithm to reconstruct wide-band speech from narrowband speech based on codebook mapping, Conf. Proc, ICSLP 94, pp. 1591-1594, Yokohama, 1994; and J. Epps, W. H. Holmes, Speech enhancement using STC-based bandwidth extension, Conf. Proc. ICSLP, 1998) and aliasing/folding approaches (see, e.g., J. Makhoul, M. Berouti, High frequency regeneration in speech coding systems, Conf. Proc. ICASSP, pp. 428-431, Washington, USA, 1979; and H. Yasukawa, Quality enhancement of band limited speech by filtering and multirate techniques, Conf. Proc. ICSLP 94, pp. 1607-1610, Yokohama, 1994). The aliasing approach is generally simple in structure. In this approach, the narrowband signal is up-sampled by inserting zeros between the narrow-band signal samples. When using such up-sampling, a reconstruction lowpass filter having a cut-off frequency at half the new sampling rate is used. When a shaping filter is substituted for this filter, the aliased/folded frequency content in the upper-frequency region extends the speech content. The drawbacks of this technique are that a harmonic speech structure is not continued in the upper-frequency region, and that a suitable amplitude level of the upper-frequency-band is generally not achieved for all speech sounds.
The codebook approach is a more advanced solution, in which the narrow frequency-band is analyzed with a codebook look-up method. The codebook index is matched one-to-one with a filter that is suitable for shaping an excitation signal. The excitation signal can, for example, be created with an aliasing/folding method. The codebook approach has also been tested for the lower frequency-band (see, e.g., the Y. Yoshida and M Abe reference cited above).
Speech signals are generally described by a short-time-segments model comprising a filter and a signal excitation. The filter describes the human vocal tract and the coupling between the excitation source and the vocal tract. The sound radiation characteristics from the mouth may also be included in this filter. Generally, it is sufficient to use an all-pole filter to estimate the vocal tract, coupling, and radiation characteristics, This filter then will only vaguely approximate zeros introduced by, for example, a nasal tract, or lateral consonants. This estimation problem can be reduced by increasing the filter order.
Speech signals are considered to be stationary during segments of 10-30 ms. This segment duration is determined by the fact that it takes approximately 70 ms for tissue in the vocal tract to change from one end-position to another. Hence, the vocal tract and the speech sounds can be completely different after this interval, but rarely after shorter durations of time.
During voiced speech segments, the poles of the filter can be described as estimates of the formants of speech, and also the coupling between the formant and the excitation source. The formants are the resonance frequencies of the vocal tract, either the whole or parts of it. Hence, the amplitude level at these formant frequencies is larger compared to adjacent frequencies, assuming the vocal folds source is present.
During unvoiced speech segments, the poles of the filter do not describe the formants, although the poles of the filter describe the resonance frequencies of the vocal tract, or more correctly the oral tract. The unvoiced speech is generated with almost no use of the lower part of the vocal tract. The number of noticeable resonances is often limited to one or two in the oral tract because of the short length of the cavity. Another aspect of the short resonators common for unvoiced speech segments is that the speech content is high in frequency, generally having prominent and perceptually important content above 3.4 kHz.
The sources that excite the filter can be divided into two types: the quasi-periodic and the turbulent noise source. The vocal folds in the larynx are the main source during voiced speech segments. This source is of a quasi-periodic type, normally having a fundamental frequency in the range of 70-400 Hz. This fundamental frequency is also called the pitch frequency, and a person can, during speech, increase the pitch frequency by about 100% compared to a relaxed state. The signal generated by the vocal folds look like a skewed half-wave rectified sinus, and thereby also generates harmonics. The harmonics are perceptually important due to the fact that formants are grouped according to their excitation's fundamental frequency; that is, formants having the same fundamental frequency will form a speech sound. It has been shown that in concurrent speech environments the fundamental frequency is even more important than the direction of the sound.
The turbulent noise source is generated by steering, with a constriction, an air stream against an obstacle or only causing a turbulent air volume velocity. When an obstacle is used, the resulting noise amplitude level is higher. Noise sources can be generated at many locations in the vocal tract, but the most prominent ones are generated in the oral cavity.
The perception of speech by the human hearing mechanism has some important functionalities. Human hearing is commonly described as having a logarithmic sensitivity with respect to both frequency and amplitude level. As a result, low frequencies carry more information in smaller frequency-bands. One way of describing this is the Barkscale, having frequency bands of 100 Hz in the lower frequency region and approximately 1 kHz in the upper frequency region. The amplitude level is often presented in decibels since this logarithmic scale is quite consistent with the amplitude level sensitivity of human hearing, or the loudness perception.