The art of processing of audio signals spans a wide range of technologies and efforts. Despite the plethora of signal processing advancements related to audio signals, the processing of audio signals including or created as part of oral communications and, particularly, human speech remains a substantial challenge. For example, despite substantial investments in research and resources, speech processing and, particularly, speech recognition systems are still quite limited. These limits are due, at least in part, to the complexities of human speech and a limited understanding of natural auditory and cognitive processing capabilities. For example, the ability to recover speech information, despite dramatic articulatory and acoustic assimilation and coarticulation of speech sounds, poses substantial hurdles to enhancement of speech signals and automated processing of the underlying information communicated in speech. These hurdles are further compounded when, for example, the individual receiving the speech signals has an impairment.
Reports indicate that only about 20 percent of the more than 30 million adults with hearing loss in this country currently use hearing aids, and by 2030 there could be over 40 million adults and over 2 million children with hearing loss in the United States. The National Council on Aging indicates that untreated hearing loss of any degree has significant consequences on people's social lives, emotional health, mental health, and physical well-being. Furthermore, the Better Hearing Institute estimates that earning potential for individuals with untreated hearing loss is reduced by an average of $23,000 per year, which is twice as much for individuals with hearing aids. When multiplied by the number of American workers with hearing loss, the magnitude of total annual lost income is staggering. While many factors are related to these numbers, hearing aid performance is an important variable as indicated by the finding that only about half of all users are satisfied with how their hearing aids perform in noise. Advancements in hearing aid performance have the potential to improve quality of life for more than 10 percent of the American population as well as productivity of the average hearing-impaired worker. U.S. Pat. No. 6,732,073 to Kluender et al. provides a substantial summary of some of the difficulties and impediments to speech signal processing and enhancement and is incorporated herein by reference.
For some time, it has been understood that at least two components of sensorineural hearing loss (SNHL) reduce listeners' access to speech information. The first is a loss of sensitivity, which results in an attenuation of speech. To overcome a loss of attenuation, the signal simply needs to be made louder and noise reduced. Accordingly, many hearing aids focus on using wide dynamic range compression and various processing strategies to boost the signal-to-noise ratio, such as noise reduction and directional microphones. The second component of SNHL is a loss of selectivity, which results in a blurring of spectral detail, or distortion. Unfortunately, due to this second component of SNHL, simple amplification of speech does not necessarily improve the listeners' ability to discern the information conveyed in the speech.
Due to substantial research, it is now established that listeners with SNHL often have compromised access to frequency-specific information because spectral detail is often smeared, or blurred, by broadened auditory filters. Loss of sharp tuning in auditory filters generally increases with degree of sensitivity loss and is due, in part, to a loss or absence of peripheral mechanisms responsible for suppression. It has been learned that in the non-impaired cochlea different frequency components of a signal serve to suppress one another, and two-tone suppression has been cast as an instance of lateral inhibition. Consequently, spectral peaks in the internal representation for hearing-impaired (HI) listeners, as opposed to normal-hearing (NH) listeners, are less intense relative to spectral contrast that is reduced and more susceptible to noise. Not only are spectral peaks harder to resolve in noise due to reduced amplitude differences between peaks and valleys, but their internal representation is spread out over wider frequency regions (smeared), resulting in less precise frequency analysis, blurring between frequency varying formant patterns, and ultimately in greater confusions between sounds with similar spectral shapes.
Simultaneous spectral contrast is the intensity difference between peaks and valleys in the spectral shape of different speech sounds. Spectral peaks (formants) reflecting vocal tract resonances are important acoustic features that help define the identity of many speech sounds. A number of experimental techniques confirm that the internal representation of spectral contrast for steady state speech sounds, like vowels, is reduced in HI compared to NH listeners. For example, it has been found that peaks in vowel masking patterns for HI listeners were not resolved as well as for NH listeners, and that peak frequencies in the internal representations were often shifted away from their corresponding formant frequencies.
Decreased signal-to-noise ratios in the internal spectrum also results from auditory filters broadened by SNHL. Others found a relationship between HI listeners' estimated auditory filter bandwidths in the region of the second formant (F2) and the amount of spectral contrast needed to identify vowels in noise. These findings indicate that noise effectively reduces internal spectral contrast and that deleterious effects of noise can be offset to some extent by an increase in spectral contrast. Similarly, it has been indicated that there is a general trading relationship between spectral resolution and the amount spectral contrast needed for vowel identification.
As stated, historically, the primary function of hearing aids is to make speech in regions of hearing loss comfortably audible. Unfortunately, in this effort, hearing aids can increase the blurring of detailed frequency information by reducing internal representations of spectral contrast in at least three ways: 1) high output levels; 2) positive spectral tilt; and 3) compression (decreased dynamic range).
First, it is well known that auditory filter tuning is level dependent. Even NH listeners experience decreased frequency selectivity at high levels needed to overcome sensitivity loss for HI listeners. In ears with SNHL, high presentation levels contribute to further reductions in frequency tuning and greater smearing of spectral detail already associated with the loss of nonlinear mechanisms.
Second, hearing aids typically provide high-frequency emphasis, or a positive spectral tilt, to compensate for increases in hearing loss with frequency. However, it has been indicated that positive spectral tilt for NH listeners actually reduces the internal representation of higher frequency formants and increases the need for greater spectral contrast. Thus, it has been hypothesized that this might occur because internal representations of some formants are characterized by ‘shoulders’ rather than peaks—as a spectral ‘irregularity’ on the skirt of a more intense formant. Using an auditory filter model, it has been demonstrated that increases in spectral tilt raise the probability that a formant will be represented as a shoulder rather than a peak (similar to increases in filter bandwidth), but suppression can serve to convert (enhance) some of these shoulders into peaks. It is likely that negative effects of increased spectral tilt in NH listeners are exacerbated in HI listeners with already poor auditory filter tuning and reduced/absent mechanisms for suppression.
Third, it has long been suspected that multichannel compression in hearing aids, which is designed to accommodate different dynamic ranges of audible speech with frequency, has the potential to reduce spectral contrast and flatten the spectrum, especially when there are many independent channels and/or high compression ratios. Notably, several studies have found that compression across many independent channels increases errors for consonants differing in place of articulation, which can be highly influenced by subtle changes in spectral shape. Some have not only reported a significant decrease in vowel identification with an increase in independent compression channels, but also found that identification and number of channels were each directly related to acoustic measures of spectral contrast.
Spectral contrast is not only important for detecting differences between static spectral shapes, but also for detecting changes, which are made more subtle by coarticulation in connected speech. For example, considering the case of a formant that ends with closure silence and begins again (after closure) at a slightly higher or lower frequency. For the HI listener, there would be no perceived difference in the offset and onset frequencies, as both would be processed by the same broadened auditory filter (i.e., the change in frequency across time would be blurred). Such would not be the case for the NH listener. Instead, contrastive process operating across time would serve to “repel” these spectral prominences making them more distinct. Most conventional hearing aid processing strategies are designed to increase audibility of speech information and to improve signal-to-noise ratio by manipulating relative intensities of speech and noise. Unfortunately, these processing strategies do not adequately address the challenges of listeners with mild SNHL who experience reductions in spectral contrast as a consequence to the intensity manipulations of the processing, nor the challenges of listeners with moderate to severe hearing loss who suffer from additional reductions in spectral contrast and increased distortion arising from cochlear damage and broadened auditory filters.
Like hearing aid users, spectral blurring experienced by cochlear implant (CI) listeners is attributable to impaired cochlear/neural functioning and to device processing that is necessary to accommodate the impairment. Severe amplitude compression is needed to fit the relatively large dynamic range of speech (about 50 dB, including the effects of vocal effort) into a restricted dynamic range of electrical stimulation (often, 5-15 dB). Furthermore, a limited number of useable electrodes (typically, between 6 and 22) are available to CI listeners, who most often cannot take full advantage of even this limited spectral information provided by their electrode arrays. This is demonstrated by speech tests in quiet and in noise and by tests measuring discrimination of spectral ripples where performance as a function of number of active electrodes asymptoted at 4-7, even though the CI listeners could use a greater number in isolation for simple pitch and level discriminations. Thus, the effective number of channels for spectrally rich sounds like speech is less than the number of active electrodes.
Limited use of available spectral detail in patterns of stimulation from the CI processor is likely due to the reduced specificity of stimulation attributable to current spread, and to decreased survival and function of spiral ganglion cells. Consequently, compared to NH listeners, CI listeners need, for example, at least 4-6 dB greater spectral contrast for vowel identification in quiet and need even greater signal-to-noise ratios (SNRs) for speech in noise. Tests using NH listeners with simulated CI processing (vocoded speech) indicate that while as few as 8-12 channels might be sufficient for very good speech understanding in quiet. As many as 20 might be needed to adequately understand speech in contexts known to be exceptionally challenging for CI listeners, particularly, competing background noise, multiple talkers, and low linguistic redundancy. As with hearing aid users, transient burst onsets and rapid formant frequency changes that distinguish consonants differing in place of articulation are most troublesome for CI listeners.
To aid speech understanding in noise, some devices include noise reduction schemes and directional microphones. CI coding strategies, like spectral peak coding strategy (SPEAK) for example, analyze incoming speech into a bank of filters (e.g., 20) and use the outputs from a small number of them (e.g., 6) to stimulate corresponding places on the electrode array. CI listeners largely rely on relative differences in across-channel amplitudes to detect formant frequency information, and this is especially problematic when there is competing noise or a small number of effective channels. Furthermore, because nonlinear processes are abolished either by the impairment itself or by placement of the electrode array, natural spectral enhancement is also lost.
Thus, systems and methods for speech processing and recognition and systems and methods for manipulating audio signals including speech to improve the understanding of HI and CI listeners must balance a wide variety of variables and unknowns and continue to have long-standing need for improvement.