The present invention relates to methods and systems for improving the quality and intelligibility of speech signals in communications systems. All communications systems, especially wireless communications systems, suffer bandwidth limitations. The quality and intelligibility of speech signals transmitted in such systems must be balanced against the limited bandwidth available to the system. In wireless telephone networks, for example, the bandwidth is typically set according to the minimum bandwidth necessary for successful communication. The lowest frequency important to understanding a vowel is about 200 Hz and the highest frequency vowel formant is about 3000 Hz. Most consonants however are broadband, usually having energy in frequencies below about 3400 Hz. Accordingly, most wireless speech communication systems, are optimized to pass between 300 and 3400 Hz.
A typical passband 10 for a speech communication system is shown in FIG. 1. In general, passband 10 is adequate for delivering speech signals that are both intelligible and are a reasonable facsimile of a person's speaking voice. Nonetheless, much speech information contained in higher frequencies outside the passband 10, mainly that related to the sounding of consonants, is lost due to bandpass filtering. This can have a detrimental impact on intelligibility in environments where a significant amount of noise is present.
The passband standards that gave rise to the typical passband 10 shown in FIG. 1 are based on near field measurements where the microphone picking up a speaker's voice is located within 10 cm of the speaker's mouth. In such cases the signal-to-noise ratio is high and sufficient high frequency information is retained to make most consonants intelligible. In far field arrangements, such as hands-free telephone systems, the microphone is located 20 cm or more from the speaker's mouth. Under these conditions the signal-to-noise ratio is much lower than when using a traditional handset. The noise problem is exacerbated by road, wind and engine noise when a hands-free telephone is employed in a moving automobile. In fact, the noise level in a car with a hands-free telephone can be so high that many broadband low energy consonants are completely masked.
As an example, FIG. 2 shows two spectrographs of the spoken word “seven”. The first spectrograph 12 is taken under quiet near field conditions. The second is taken under the noisy, far field condition, typical of a hands-free phone in a moving automobile. Referring first to the “quiet” seven 12, we can see evidence of each of the sounds that make up the spoken word seven. First we see the sound of the “S” 16. This is a broadband sound having most of its energy in the higher frequencies. We see the first and second Es and all their harmonics 18, 22, and the broadband sound of the “V” 20 sandwiched therebetween. The sound of the “N” at the end of the word is merged with the second E22 until the tongue is released from the roof of the mouth, giving rise to the short broadband energies 24 at the end of the word.
The ability to hear consonants is the single most important factor governing the intelligibility of speech signals. Comparing the “quiet” seven 12 to the “noisy” seven 14, we see that the “S” sound 16 is completely masked in the second spectrograph 14. The only sounds that can be seen with any clarity in the spectrograph 14 of the “noisy” seven are the sounds of the first and second Es, 18, 22. Thus, under the noisy conditions, the intelligibility of the spoken word “seven” is significantly reduced. If the noise energy is significantly higher than the consonants' energies (e.g. 3 dB), no amount of noise removal or filtering within the passband will improve intelligibility.
Car noise tends to fall off with frequency. Many consonants, on the other hand, (e.g., F, T, S) tend to possess significant energy at much higher frequencies. For example, often the only information in a speech signal above 10 KHz, is related to consonants. FIG. 3 repeats the spectrograph of the word “seven” recorded in a noisy environment, but extended over a wider frequency range. The sound of the “S” 16 is clearly visible, even in the presence of a significant amount of noise, but only at frequencies above about 6000 Hz. Since cell phone passbands exclude frequencies greater than 3400 Hz, this high frequency information is lost in traditional cell phone communications. Due to the high demand for bandwidth capacity, expanding the passband to preserve this high frequency information is not a practical solution for improving the intelligibility of speech communications.
Attempts have been made to compress speech signals so that their entire spectrum (or at least a significant portion of the high frequency content that is normally lost) falls within the passband. FIG. 4 shows a 5500 Hz speech signal 26 that is to be compressed in this manner. Signal 28 in FIG. 5 is the 5500 Hz signal 26 of FIG. 4 linearly compressed into the narrower 3000 Hz range. Although the compressed signal 28 only extends to 3000 Hz, all of the high frequency content of the original signal 26 contained in the frequency range from 3000 to 5500 is preserved in the compressed signal 28 but at the cost of significantly altering the fundamental pitch and tonal qualities of the original signal. All frequencies of the original signal 26, including the lower frequencies relating to vowels, which control pitch, are compressed into lower frequency ranges. If the compressed signal 28 is reproduced without subsequent re-expansion, the speech will have an unnaturally low pitch that is unacceptable for speech communication. Expanding the compressed signal at the receiver will solve this problem, but this requires knowledge at the receiver of the compression applied by the transmitter. Such a solution is not practical for most telephone applications, where there are no provisions for sending coding information along with the speech signal.
In order to preserve higher frequency speech information an encoding system or compression technique for telephone or other open network applications where speech signal transmitters and receivers have no knowledge of the capabilities of their opposite members must be sufficiently flexible such that the quality of the speech signal reproduced at the receiver is acceptable regardless of whether a compressed signal is re-expanded at the receiver, or whether a non-compressed signal is subsequently expanded. According to an improved encoding system or technique a transmitter may encode a speech signal without regard to whether the receiver at the opposite end of the communication has the capability of decoding the signal. Similarly, a receiver may decode a received signal without regard to whether the signal was first encoded at the transmitter. In other words, an improved encoding system or compression technique should compress speech signals in a manner such that the quality of the reproduced speech signal is satisfactory even if the signal is reproduced without re-expansion at the receiver. The speech quality will also be satisfactory in cases where a receiver expands a speech signal even though the received signal was not first encoded by the transmitter. Further, such an improved system should show marked improvement in the intelligibility of transmitted speech signals when the transmitted voice signal is compressed according to the improved technique at the transmitter.