This invention pertains generally to the field of audio signal processing and particularly to hearing aids and speech recognition.
Individuals with normal hearing are able to perceive speech in the face of extreme context-sensitivity resulting from coarticulation. The ability of listeners to recover speech information, despite dramatic articulatory and acoustic assimilation, is remarkable and central to understanding speech perception. The degree to which listeners perceptually accommodate articulatory constraints often has encouraged perceptual theories that assume relatively detailed reference to articulatory acts themselves, either with respect to general theoretical commitments, or with appeal to specialized speech perception processes unique to humans and to vocal tracts. In each case, correspondences between perception and production are typically taken as evidence of perception of articulatory acts per se. Some approaches have been to appeal to tacit knowledge of coarticulatory acts or their acoustic consequences, and such knowledge-based processes can be viewed as more (e.g., Repp, B. H., xe2x80x9cPhonetic Trading Relations and Context Effects: New Evidence for a Speech Mode of Perception,xe2x80x9d Psychological Bulletin, Vol. 92, 1982, pp. 81-110.) or less (e.g., Diehl, R. L. and Kluender, K. R., xe2x80x9cOn Categorization of Speech Sounds,xe2x80x9d Stevan Harnad (Ed.), Categorical Perception, Oxford University Press, 1987, pp. 226-253) specific to speech.
Lack of invariance in the relation between fundamental linguistic unitsxe2x80x94phonemesxe2x80x94and attributes of the acoustic signal poses a central problem in understanding the nature of speech perception. The basic problem is that there seem to exist few or no unitary attributes in the acoustic signal that uniquely specify particular phonemes. The prime culprit for this state of affairs is coarticulation of speech sounds. Coarticulation refers to the spatial and temporal overlap of adjacent articulatory activities. This is reflected in the acoustic signal by severe context-dependence; acoustic information specifying one phoneme varies substantially depending on surrounding phonemes. One of the more widely described cases for such context dependence concerns the realization of the phonemes /d/ and /g/ as a function of preceding liquid (Mann, V. A., xe2x80x9cInfluence of Preceding Liquid in Stop-Consonant Perception,xe2x80x9d Perception and Psychophysics, Vol. 28, 1980, pp. 407-412.) or fricative (Mann, V. A. and Repp, B. H., xe2x80x9cInfluence of Preceding Fricative on Stop Consonant Perception,xe2x80x9d Journal of the Acoustical Society of America, Vol. 69, 1981, pp. 548-558). Perception of /d/ as contrasted with perception of /g/, is largely signaled by the onset frequency and frequency trajectory of the third formant (F3). In the context of a following /a/, a higher F3 onset encourages perception of /da/ while a lower onset results in perception of /ga/. The onset frequency of the F3 transition also can vary as a function of the preceding consonant. For example, F3-onset frequency for /da/ is higher following /al/ in /alda/ relative to when following /ar/ in /arda/. The offset frequency of F3 is higher for /al/ owing to a more forward place off articulation and lower for /ar/. Perception of /da/ and /ga/ has been shown to be affected by the composition of preceding acoustic information in a fashion that accommodates these patterns in production. For a series of synthesized consonant-vowel syllables (CVs) varying in onset characteristics of the third formant (F3) and varying perceptually from /da/ to /ga/, subjects are more likely to perceive /da/ when preceded by the syllable /ar/, and to perceive /ga/ when preceded by /al/ (Mann, V. A., xe2x80x9cInfluence of Preceding Liquid in Stop-Consonant Perception,xe2x80x9d Perception and Psychophysics, Vol. 28, 1980, pp. 407-412). In subsequent studies, the effect has been found for speakers of Japanese who cannot distinguish between /l/ and /r/ (Mann, V. A., xe2x80x9cDistinguishing Universal and Language-Dependent Levels of Speech Perception: Evidence from Japanese Listeners"" Perception of English xe2x80x9clxe2x80x9d and xe2x80x9cr,xe2x80x9d Cognition, Vol. 24, 1986, pp. 169-196) and for prelinguistic infants (Fowler, C. A., Best, C. T. and McRoberts, G. W., xe2x80x9cYoung Infants"" Perception of Liquid Coarticulatory Influences on Following Stop Consonants,xe2x80x9d Perception and Psychophysics, Vol. 48, 1990, pp. 559-570). The important point is that, for the very same stimulus with F3 onset intermediate between /da/ and /ga/, the percept is altered as a function of preceding context. Listeners perceive speech in a manner that suggests sensitivity to the compromise between production of neighboring phonetic units.
Different theoretical perspectives provide alternative accounts for how acoustic effects of coarticulation are disambiguated in perception. One approach has been to search harder for invariant attributes in the signal that correspond to phonetic features, and hence phonemes (e.g. Stevens, K. N. and Blumstein, S. E., xe2x80x9cThe Search for Invariant Acoustic Correlates of Phonetic Features,xe2x80x9d P. D. Eimas and J. L. Miller (Ed.), Perspectives in the Study of Speech, Hillsdale, N.J.: Erlbaum, 1981). To date, this approach has yielded mixed results with more recent efforts being directed to relatively modest features of the acoustic signal that may seem likely to have slim prospects for survival under noisy conditions typical to speech communication. Further, it is unlikely that invariants exist to explain the aforementioned perceptual phenomenon when one considers the fact that the exact same acoustic information is perceived differently within different contexts. Another tack can be found in Motor Theory (e.g. Liberman, A. M. and Mattingly, I. G., xe2x80x9cThe Motor Theory of Speech Perception Revisited,xe2x80x9d Cognition, Vol. 21, 1985, pp. 1-36) which holds that phonetic perception is the perception of the speech gestures and that processes specific to humans recover gestural invariants not apparent in the acoustic signal. Because the lack of invariance in the acoustic signal is the consequence of variability in articulator movements, later versions of this theory suggest that it is intended gestures which are detected.
A third approach is that of Direct Realism (e.g. Fowler, C. A., xe2x80x9cAn Event Approach to the Study of Speech Perception from a Direct-Realist Perspective,xe2x80x9d Journal of Phonetics, Vol. 14, 1986, pp. 3-28). Direct Realism is a general theory for all senses holding that perception is an act by which properties of the physical world that are significant to a perceiver, xe2x80x9cdistal events,xe2x80x9d are directly recovered without intermediate construction. For speech perception, distal events are held to be linguistically relevant articulations of the vocal tract. In terms of what one desires in a broad theoretical framework, Direct Realism may be the most general, elegant, and internally consistent theory. Perhaps the most critical concern with regard to this approach, however, is that one must be able to solve the xe2x80x9cinverse problem.xe2x80x9d In order to recover a unique distal event in any modality, the perceiver has only the physical energy available to sensory receptors. Independent of classic concerns regarding the extent to which one should view this source of information as rich or impoverished, what must be true is that there is sufficient information to successfully make the inverse transformation to a unique distal event. This requires the existence of some sort of invariant in the signal, perhaps an invariant specified as a function of time. In the absence of an invariant, the best one can do is define some set of possible distal events. Physical acoustic invariants signaling phonemes have not been easy to come by, and Fowler, C. A., xe2x80x9cInvariants, Specifiers, Cues: An Investigation of Locus Equations as Information for Place of Articulation,xe2x80x9d Perception and Psychophysics, Vol. 55, 1994, pp. 597-610 has provided evidence that one recent candidate, locus equations (e.g., Sussman, H., xe2x80x9cNeural Coding of Relational Invariance in Speech: Human Language Analogs to the Barn Owl,xe2x80x9d Psychological Review, Vol. 96, 1989, pp. 631-642 and Sussman, H., xe2x80x9cThe Representation of Stop Consonants in Three-Dimensional Acoustic Space,xe2x80x9d Phonetica, Vol. 48, 1991, pp. 18-31), does not provide an invariant for place of articulation. Recovery of articulatory movement from speech acoustics has proven quite difficult.
There has been a good deal of effort made to recover articulatory gestures from the physical acoustic waveform. Often as part of an effort to build speech-recognition machines, these efforts are founded on the hope that greater success at overcoming the problem of lack of invariance may be found through specification of articulatory sources. In general, the history of these efforts can be summarized in the following manner (for review see McGowan, R. S., xe2x80x9cRecovering Articulatory Movement from Formant Frequency Trajectories Using Task Dynamics and a Genetic Algorithm: Preliminary Model Tests,xe2x80x9d Speech Communication, Vol. 14, 1994, pp. 19-48; Schroeter, J. and Sondhi, M. M., xe2x80x9cSpeech Coding Based on Physiological Models of Speech Production,xe2x80x9d S. Furui and M. M. Sondhi (Eds.), Advances in Speech Signal Processing, New York: Marcel Dekker, Inc. 1992). Early efforts attempting to use limited acoustic information such as the first three-formant frequencies to derive the area function of the vocal tract were not successful because multiple area functions could be specified by the same waveform. More recent efforts have been more successful to the extent that they incorporated more specific constraints on the nature of the vocal tract together with dynamic and kinematic information. The marriage of these two sources of information is critical. Kinematics alone do not help to recover articulatory acts, i.e., solve for the inverse. This is because, if one begins with a large or infinite set of potential sound sources at time t1, introducing a second large set of potential sources at t2 does little or nothing in the way of narrowing the set of possible sources, let alone permit specification of a single distal event. To the extent that more recent efforts to recover articulatory movement from acoustics have been successful, they have succeeded by virtue of introducing detailed speech-specific constraints on the nature of transformations that can be made as a function of time.
McGowan, R. S., xe2x80x9cRecovering Articulatory Movement from Formant Frequency Trajectories Using Task Dynamics and a Genetic Algorithm: Preliminary Model Tests,xe2x80x9d Speech Communication, Vol. 14, 1994, pp. 19-48, used a task dynamic model (Saltzman, E., Task-Dynamic Coordination of the Speech Articulators: A Preliminary Model,xe2x80x9d Experimental Brain Research, Vol. 15, 1986, pp. 129-144; Saltzman, E. L. and Kelso, J. A. S., xe2x80x9cSkilled Actions: A Task Dynamic Approach,xe2x80x9d Psychological Review, Vol. 94, 1987, pp. 84-106) driving six vocal tract variables with transformations between tract variables and articulators derived from an articulatory model (Mermelstein, P., xe2x80x9cArticulatory Model for the Study of Speech Production,xe2x80x9d Journal of the Acoustical Society of America, Vol. 53, 1973, pp. 1070-1082). McGowan, R. S. and Rubin, P. E., xe2x80x9cPerceptual Evaluation of Articulatory Movement Recovered from Acoustic Data,xe2x80x9d Journal of the Acoustical Society of America, Vol. 96 (5 pt. 2), 1994, p. 3328, exploited a genetic learning algorithm to discover relations between task-dynamic parameters and speech acoustics for six utterances by a single talker. Results were somewhat mixed in that, while the model got many things right, errors persisted and McGowan, R. S., xe2x80x9cRecovering Articulatory Movement from Formant Frequency Trajectories Using Task Dynamics and a Genetic Algorithm: Preliminary Model Tests,xe2x80x9d Speech Communication, Vol. 14, 1994, pp. 19-48, notes that future applications likely require customization of the model to individual talkers. Related efforts continue to be productive (see, e.g., Schroeter, J. and Sondhi, M. M., xe2x80x9cSpeech Coding Based on Physiological Models of Speech Production,xe2x80x9d S. Furui and M. M. Sondhi (Eds.), Advances in Speech Signal Processing, New York: Marcel Dekker, Inc. 1992), but one point is becoming increasingly clear. The extent to which these attempts to solve the inverse problem are successful seems to depend critically upon models engendering highly-realistic details of sound production specific to human vocal tracts, and often to a single human vocal tract. Although some of the efforts to recover vocal tract movements from the acoustic signal have been conducted in the desire for effective machine speech recognition, thus far these attempts have been less successful than straightforward engineering approaches that exploit powerful computers and algorithms to search through hundreds of thousands of templates. Notably, successful template approaches require practice for adjustment to individual talkers.
As noted above, perception of syllable-initial /d/ and /g/ can be influenced by the composition of preceding acoustic information such that, for a series of synthesized consonant-vowel syllables (CVs) varying in onset characteristics of F3 and varying perceptually from /da/ to /ga/, subjects are more likely to perceive /da/ when preceded by the syllable /ar/, and to perceive /ga/ when preceded by /al/. The received interpretation of findings that perceptual performance corresponds with acoustic consequences of producing /da/ and /ga/ following /ar/ and /al/ has been that listeners are somehow sensitive to articulatory implementation. Several experiments have been conducted to assess the degree to which these perceptual effects are specific to qualities of articulatory sources, and whether a simple general process such as perceptual contrast may play a significant role.
Mann, V. A., xe2x80x9cInfluence of Preceding Liquid in Stop-Consonant Perception,xe2x80x9d Perception and Psychophysics, Vol. 28, 1980, pp. 407-412, concluded that the perceptual effect results from a mechanism specialized to compensate for vocal tract constraints through the use of xe2x80x9ctacit reference to the dynamics of speech production.xe2x80x9d Four experiments were conducted to test the plausibility of general auditory processes in accounting for these effects, and each is described in greater detail in Lotto, A. J. and Kluender, K. R., xe2x80x9cGeneral Contrast Effects in Speech Perception: Effect of Preceding Liquid on Stop Consonant Identification,xe2x80x9d Perception and Psychophysics, Vol. 60, 1998, pp. 602-619. In three experiments, series of CV stimuli varying in F3-onset frequency /da-ga/ were preceded by speech versions or nonspeech analogues of /al/ and /ar/. The effect of liquid identity on stop-consonant labeling maintained when the preceding VC was produced by a female speaker and the CV was modeled after a male speaker""s production. Labeling boundaries also shifted when the CV was preceded by a sine-wave glide modeled after F3 characteristics of /al/ and /ar/. This effect maintained even when the preceding sine wave was of constant frequency equal to the offset frequency of F3 from a natural production. Finally, four Japanese quail (Coturnix coturnix japonica) were used to test further the generality of this effect (Lotto, A. J., Kluender, K. R. and Holt, L. L., xe2x80x9cPerceptual Compensation for Coarticulation by Japanese Quail (Coturnix cotrunix japonica),xe2x80x9d Journal of the Acoustical Society of America, Vol. 102, 1997, pp. 1134-1140). Birds were trained by operant procedures to peck a lighted key when presented with either the syllable /da/ or /ga/ and to refrain from pecking it when presented with the alternative syllable (/ga/ or /da/). They were presented with test disyllables consisting of the synthesized /al/ or /ar/ followed by one of the ambiguous intermediary members of the /da-ga/ series. Avian responses to intermediate novel test stimuli indicate an effect of the preceding syllable like that for human listeners such that xe2x80x98labelingxe2x80x99 shifted to more /ga/ responses following /al/ and more /da/ responses following /ar/. For all of these findings, when energy preceding the energy signaling the consonant is of higher frequency (/al/, FM glide, pure tone), the percept more often corresponds to the consonant with the lower frequency F3 (/ga/). This suggests that spectral contrast plays an important role.
Coarticulation with consonants also can exert a powerful influence on the acoustic realization of vowels. Isolated vowels are extremely rare or nonexistent in fluent speech, and many studies have addressed the far more typical cases of vowels produced within consonantal contexts. Lindblom, B. E. F., xe2x80x9cSpectrographic Study of Vowel Reduction,xe2x80x9d Journal of the Acoustical Society of America, Vol. 35, 1963, pp. 1773-1781, conducted spectrographic measurements of naturally produced CVCs and found that, relative to formant-frequency values for vowels produced in isolation, formant values toward the centers of the CVCs were lower when consonants were labial and higher when consonants were palato-alveolar. Lindblom, B. E. F. and Studdert-Kennedy, M., xe2x80x9cOn the Role of Formant Transitions in Vowel Recognition,xe2x80x9d Journal of the Acoustical Society of America, Vol. 42, 1967, pp. 830-843, investigated the role of consonant-vowel transitions for perception of vowels in CVCs. They synthesized three series of 240-ms duration CVC stimuli with vowels varying from /U/ to /I/. One series consisted of steady-state vowels. The other two series had continuously varying formant frequencies appropriate for /wVw/ and for /jVj/. The /wUw-wIw/ series of syllables began and ended with lower F2 and F3 frequencies, and the /jVj/ series began and ended with higher F2 and F3 frequencies. More vowels were perceived as /I/ in the /wVw/ context and fewer as /I/ in the /jVj/ context as would be predicted if perception of vowels was complementary to observed regularities in production. Much later, Nearey, T. M., xe2x80x9cStatic, Dynamic, and Relational Properties in Vowel Perception,xe2x80x9d Journal of the Acoustical Society of America, Vol. 85, 1989, pp. 2088-2113, extended the positive findings to /dVd/ and /bVb/ syllables with vowel sounds ranging from /o/-/xcex9/ and /xcex9/-/xcex5/. Again, contrast plays a role. When preceding energy is of higher F2 frequency (/d/), the following vowel is more likely to be perceived as a lower frequency vowel.
There are a large number of experimental precedents in the psychoacoustics literature for spectral contrast effects such as those found for coarticulated speech sounds. Most often, these effects have been described as xe2x80x9cauditory enhancement.xe2x80x9d Summerfield and his colleagues (Summerfield, Q., Haggard, M. P., Foster, J., and Gray, S. xe2x80x9cPerceiving vowels from uniform spectra: Phonetic exploration of an auditory aftereffect,xe2x80x9d Perception and Psychophysics, Vol. 35, 1984, pp. 203-213) showed that, when a uniform harmonic spectrum is preceded by a spectrum that is complementary to a particular vowel with troughs replacing peaks and vice versa, listeners reported hearing a vowel during presentation of the uniform spectrum. And, a precursor uniform harmonic spectrum enhances vowel percepts when defined by an harmonic spectrum with only very modest spectral peaks (2-5 dB) (Summerfield, Q., Sidwell, A., and Nelson, T. xe2x80x9cAuditory enhancement of changes in spectral amplitude,xe2x80x9d Journal of the Acoustical Society of America, Vol. 81, 1987, pp. 700-707.) One can describe all of the effects in terms of perception being predicated on the basis of spectral contrast between two complex sounds.
Perceiving vowel sounds in uniform spectra (following appropriate complementary spectral patterns) has a well-known precedent in psychoacoustics. If just one member of a set of harmonics of equal amplitude is omitted from a harmonic series and is reintroduced, then it stands out perceptually against the background of the pre-existing harmonics (Green, D. M., McKey, M. J., and Licklider, J. C. R. xe2x80x9cDetection of a pulsed sinusoid in noise as a function of frequency,xe2x80x9d Journal of the Acoustical Society of America, Vol. 31, 1959, pp. 1146-1152; Cardozo, B. L. xe2x80x9cOhm""s Law and masking,xe2x80x9d Institute for Perception Research Annual Progress Report, Vol. 2, 1967, pp. 59-64; Viemeister, N. F. xe2x80x9cAdaptation of masking,xe2x80x9d G. van den Brink and F. A. Bilsen (Eds.), Psychophysical, Physiological, and Behavioral Studies in Hearing, Delft University Press, 1980, pp. 190-197; Houtgast, T. xe2x80x9cPsychophysical evidence for lateral inhibition in hearing,xe2x80x9d Journal of the Acoustical Society of America, Vol. 51, 1972, pp. 1885-1894.) Viemeister (Viemeister, N. F. xe2x80x9cAdaptation of masking,xe2x80x9d G. van den Brink and F. A. Bilsen (Eds.), Psychophysical, Physiological, and Behavioral Studies in Hearing, Delft University Press, 1980, pp. 190-197) demonstrated that the threshold for detecting a tone in an harmonic complex is 10-12 dB lower when the incomplete harmonic complex (missing the target zone) is continuous as compared to when the onset of the inharmonic complex is the same as that for the target zone. This was referred to as an xe2x80x9cenhancement effect.xe2x80x9d McFadden and Wright (McFadden, D., and Wright, B. A. xe2x80x9cTemporal decline of masking and comodulation detection differences,xe2x80x9d Journal of the Acoustical Society of America, Vol. 88, 1990, pp. 711-724) investigated comodulation detection differences using flanking bands that were gated either simultaneously with the signal band or gated at varying times prior to signal onset. They found that signal detectability improved by as much as 25 dB when flanking 100-Hz bands of noise preceded the signal by durations of 5 to 700 ms. All these results are consonant with the findings described above concerning effects of preceding formants on perception of vowels in CVC syllables. Enhancement effects also operate across silent intervals like those commonly observed corresponding to vocal-tract closure in the cases of /alda-alga/ and /arda-arga/. McFadden and Wright (McFadden, D., and Wright, B. A. xe2x80x9cTemporal decline of masking and comodulation detection differences,xe2x80x9d Journal of the Acoustical Society of America, Vol. 88, 1990, pp. 711-724) found that, for flanking bands preceding signal presentation, a silent interval as long as 355 ms between flanking bands and the flanking bands plus signal was insufficient to fully attenuate the enhancing effects of spectral energy away from the signal to be detected. Enhancement effects maintain across silent intervals at least as long as those encountered in connected speech.
There are several potential explanations for these effects. Summerfield (Summerfield, Q., Haggard, M. P., Foster, J., and Gray, S. xe2x80x9cPerceiving vowels from uniform spectra: Phonetic exploration of an auditory aftereffect,xe2x80x9d Perception and Psychophysics, Vol. 35, 1984, pp. 203-213; Summerfield, Q., Sidwell, A. and Nelson, T. xe2x80x9cAuditory enhancement of changes in spectral amplitude,xe2x80x9d Journal of the Acoustical Society of America, Vol. 81, 1987, pp. 700-707) suggested that the effect may be rooted in peripheral sensory adaptation. However, Viemeister and Bacon (Viemeister, N. F., and Bacon, S. P. of America, Vol. 71, 1982, pp. 1502-1507) showed that, not only was an xe2x80x9cenhancedxe2x80x9d target tone more detectable, the tone also served as a more effective masker of the following tone. They suggested that suppression must be included in an adaptation scenario to place it in closer accord to this finding. Different frequency components of a signal serve to suppress one another, and Viemeister and Bacon suggested that non-signal channels are adapted such that their ability to suppress the signal is attenuated. This explanation is consistent with studies of two-tone suppression which has been cast as an instance of lateral inhibition in hearing, (Houtgast, T. xe2x80x9cPsychophysical evidence for lateral inhibition in hearing,xe2x80x9d Journal of the Acoustical Society of America, Vol. 51, 1972, pp. 1885-1894.) Investigators have argued that suppression helps to provide sharp tuning (e.g., Wightman, F., McKee, T., and Kramer, M. xe2x80x9cFactors influencing frequency selectivity in normal and hearing-impaired listeners,xe2x80x9d E. F. Evans and J. P. Wilson (Eds.) Psychophysics and Physiology of Hearing, Academic Press, 1977, pp. 295-310; Festen, J. M., and Plomp, R. xe2x80x9cRelations between auditory functions in normal hearing,xe2x80x9d Journal of Acoustical Society of America, Vol. 70, 1981, pp. 356-369) and with respect to speech perception, Houtgast (Houtgast, T. xe2x80x9cAuditory analysis of vowel-like sounds,xe2x80x9d Acustica, Vol. 31, 1974, pp. 320-324) has argued that this process serves to sharpen the neural projections of a vowel spectrum in a fashion that effectively provides formant extraction. Summerfield (Summerfield, Q., Haggard, M. P., Foster, J., and Gray, S. xe2x80x9cPerceiving vowels from uniform spectra: Phonetic exploration of an auditory afereffect.xe2x80x9d Perception and Psychophysics, Vol. 35, 1984, pp. 203-213; Summerfield, Q., Sidwell, A., and Nelson, T. xe2x80x9cAuditory enhancement of changes in spectral amplitude,xe2x80x9d Journal of the Acoustical Society of America, Vol. 81, 1987, pp. 700-707) suggests that either simple adaptation or adaptation of suppression could serve to enhance changes in spectral regions where previously there has been relatively little energy.
There also exist several neurophysiological observations that bear upon enhancement effects. In particular, a number of neurophysiological studies of auditory nerve (AN) recordings (e.g., Smith, R. L. and Zwislocki, J. J., xe2x80x9cResponses of Some Neurons of the Cochlear Nucleus to Tone-Intensity Increments,xe2x80x9d Journal of the Acoustical Society of America,xe2x80x9d Vol. 50, 1971, pp. 1520-1525; Smith, R. L., xe2x80x9cAdaptation, Saturations, and Physiological Masking in Single Auditory-Nerve Fibers,xe2x80x9d Journal of the Acoustical Society of America, Vol. 65, 1979, pp. 166-178; Smith, R. L., et al., xe2x80x9cSensitivity of Auditory-Nerve Fibers to Changes in Intensity: A Dichotomy Between Decrements and Increments,xe2x80x9d Journal of the Acoustical Society of America, Vol. 78, 1985, pp. 1310-1316) strongly imply a role for peripheral adaptation. More recently, Delgutte, B., et al., xe2x80x9cNeural Encoding of Temporal Envelope and Temporal Interactions in Speech,xe2x80x9d W. Ainsworth and S. Greenberg (Eds.), Auditory Basis of Speech Perception, pp. 1-9, European Speech Communication Association 1996, (see also Delgutte, B., xe2x80x9cRepresentation of Speech-Like Sounds in the Discharge Patterns of Auditory Nerve Fibers,xe2x80x9d Journal of the Acoustical Society of America, Vol. 68, 1980, pp. 843-857; Delgutte, B., xe2x80x9cAnalysis of French Stop Consonants with a Model of the Peripheral Auditory System,xe2x80x9d J. S. Perkell and D. H. Klatt (Eds.), Invariance and Variability of Speech Processes, pp. 131-177, Erlbaum: Hillsdale, N.J. 1986; and, Delgutte, B., xe2x80x9cAuditory Neural Processing of Speech,xe2x80x9d W. J. Hardcastle and J. Laver (Eds.), The Handbook of Phonetic Sciences, Oxford: Blackwell, 1996, pp. 507-538; Delgutte, B. and Kiang, N. Y. S., xe2x80x9cSpeech Coding in the Auditory Nerve IV: Sounds with Consonant-Like Dynamic Characteristics,xe2x80x9d Journal of the Acoustical Society of America,xe2x80x9d Vol. 75, 1984, pp. 897-907), have established the case for a much broader role of peripheral adaptation for perception of speech. He notes that peaks in AN discharge rate correspond to spectro-temporal regions that are rich in phonetic information, and that adaptation increases the resolution with which onsets are represented. This role of adaptation for encoding onset information is consistent with earlier observations noted above. Perhaps most important to questions addressed in this application, Delgutte notes neurophysiological evidence that xe2x80x9cadaptation enhances spectral contrast between successive speech segments.xe2x80x9d This enhancement arises because a fiber adapted by stimulus components close to its CF is relatively less responsive to subsequent energy at that frequency, while stimulus components not present immediately prior are encoded by fibers that are unadaptedxe2x80x94essentially the same process offered by psychoacousticians but now grounded to physiology. Delgutte also notes that adaptation takes place on many timescales. In general, adaptation effects are sustained longer with increasing level in the auditory system. Some of the temporally extended psychoacoustic effects described above may be less likely to have very peripheral (auditory nerve) origin. Most recently, Scutt, M. J., et al., xe2x80x9cPsychophysical and Physiological Responses to Signals Which are Enhanced by Temporal Context,xe2x80x9d Abstracts of the 20th Midwinter Meeting of the Association for Research in Otolaryngology, 1997, p. 188, report evidence of enhancement in the cochlear nucleus consistent with adaptation of inhibition (suppression); however, the time course at that level appears too short to accommodate the full range of psychophysical findings.
Taken together, these precedents suggest that simple adaptation and/or adaptation of suppression provide appealing explanation for results from experiments described above. With respect to peripheral sensory adaptation/suppression being a potential candidate for explaining perceptual contrast effects found thus far, there is one piece of potentially contradictory data. Mann, V. A. and Liberman, A. M., xe2x80x9cSome-Differences Between Phonetic and Auditory Modes of Perception,xe2x80x9d Cognition, Vol. 14, 1983, pp. 211-235, found that, when only F3 transitions from a series of stimuli ranging from /da/ to /ga/ were presented to one ear with the rest of the stimulus complex presented to the other ear in a discrimination task, discrimination peaks shifted depending upon whether /al/ or /ar/ was presented as the first syllable. Based on this effect of information from the contralateral ear, Mann and Liberman argued that peripheral auditory explanations must be ruled out. One problem with this interpretation is that F2 offsets for /ar/ syllables were of higher frequency than F2-offsets for /al/ syllables. It is known that identification of /da-ga/ syllables is affected by the onset frequency of F2 with higher F2 favoring /ga/ percepts (Delattre, P. C., et al., xe2x80x9cAcoustic Loci and Transitional Cues for Consonants,xe2x80x9d Journal of the Acoustical Society of America, Vol. 27, 1955, pp. 769-773). This being the case, monaural frequency contrast of F2 would predict exactly the pattern of response observed: more /ga/ (high F2) responses following /al/ (low F2). Because energy for F2 for both syllables was delivered to the same ear, these results cannot rule out a monaural peripheral explanation. In addition, it was already noted that temporally extended enhancement effects are likely to have a neurophysiological origin beyond AN. Owing to the fact thatxe2x80x94only two synapses away from the hair cellxe2x80x94substantial contralateral connections converge at the inferior colliculus (and superior olive), one must be cautious concluding the level of the auditory system at which some process occurs on the basis of dichotic studies.
What can be concluded is that there is substantial evidence from many sources suggesting how adaptation and suppression can support perceptual contrast (enhancement). Beyond the efforts of Summerfield and his colleagues, however, very little has been made of this ubiquitous effect as reflected in perception of speech. Furthermore, if that understanding can be exploited by devices that improve communication of persons with hearing impairment, perceptual contrast need not provide a complete account in order to provide a very useful component. The approach is to exploit simple contrastive processes through signal processing in a fashion that expands the perceptual space, making adjacent speech sounds more perceptually distinctive. Because coarticulation is always assimilatory, no matter what the phonetic distinction, contrast will always serve to undo such assimilation. Concretely, if coarticulatory (assimilative) effects of preceding vowel /u/ is to make a /g/ more /b/-like (lower F2, less distinct from /b/), then contrast will serve to make /g/ perceptually less /b/-like and more like a modal /g/. One also can consider the converse case for /b/ following /i/. Overall, contrast always serves to perceptually xe2x80x9cdrive sounds awayxe2x80x9d from their neighbors (in this case along the F2 dimension) following assimilative effects of preceding speech sounds. If this process can be enhanced through hearing aids, perception may be improved in ways not possible with typical amplification strategies.
Cochlear hearing impairment is associated with reduced frequency selectivity and with loudness recruitment. These two factors are not independent. Elevated thresholds for hearing impaired listeners result in limited dynamic range. Once amplification has been introduced to make the signal suprathreshold, the system is in a compressive state, leading to xe2x80x9cspectral smearingxe2x80x9d (Moore, B. C. J., et al., xe2x80x9cSimulations of the Effect of Hearing Impairment on Speech Perception,xe2x80x9d W. Ainsworth and S. Greenburg, Auditory Basis of Speech Perception, European Speech Communication Association, pp. 1-9, 1996). The consequences of this deficiency in spectral definition seem to be more severe for some aspects of the speech signal than for others. As might be expected, for example, amplitude envelope shapes suffer least when audibility is improved with amplification, probably owing to the ability to encode such information in temporal firing patterns irrespective of spectral detail. By contrast, most types of spectral information are perceived poorly even when audibility is provided (e.g., Revoile, S. G., et al., xe2x80x9cSpectral Cues to Perception of /d,n,l/ by Normal and Impaired-Hearing Listeners,xe2x80x9d Journal of the Acoustical Society of America, Vol. 90, pp. 787-793, 1991; Summers, V. and Leek, M. R., xe2x80x9cFrequency Glide Discrimination in the F2 Region by Normal and Hearing-Impaired Listeners,xe2x80x9d Journal of the Acoustical Society of America, Vol. 97, pp. 3825-3832, 1995; Turner, C. W., et al., xe2x80x9cFormant Transition Duration and Speech Recognition in Normal and Hearing-Impaired Listeners,xe2x80x9d Journal of the Acoustical Society of America, Vol. 101, pp. 2822-2838, 1997). Additional amplification not only does not help in these cases, but additional increments in amplification can even lead to decreased speech recognition (Hogan, C. and Turner, C. W., xe2x80x9cHigh-Frequency Amplification: Benefits for Hearing-Impaired Listeners,xe2x80x9d Journal of the Acoustical Society of America, Vol. 104, pp. 411-432, 1998; Rankovic, C. M., xe2x80x9cAn Application of the Articulation Index to Hearing Aid Fitting,xe2x80x9d Journal of Speech and Hearing Research, Vol. 34, pp. 391-402, 1991).
The present invention provides a method and apparatus for enhancing an auditory signal. The present invention employs a process which enhances spectral differences between sounds in a fashion mimicking that of human auditory systems. Implementation imitates neuroprocesses of adaptation, suppression, adaptation of suppression, and descending inhibitory pathways. Thus, the present invention serves to make sounds, particularly speech sounds, more distinguishable.
In accordance with the present invention, an input auditory signal is divided into a plurality of spectral channels. This may be accomplished, for example, by applying the input auditory signal to a bank of gammatone or Quadrature Mirror Filters. An output gain for each channel is derived based on the time varying history of energy in the channel. The magnitude of the output gain thus derived is preferably inversely related to the history of energy in the channel. For example, the output gain may be derived by determining a weighted energy history of the channel, converting the weighted energy history into an RMS history weighting value, and subtracting the RMS history weighting value from unity to determine the output gain for the channel. The output gain for each channel preferably also takes into consideration the time varying history of energy in neighboring spectral channels. Thus, the output gain for each channel may preferably be derived by subtracting the ratio of the RMS history weighting value for the channel to a sum of RMS history weighting values for neighboring channels from unity to determine the output gain for the channel. The output gain thus derived is applied to the channel to form a plurality of modified spectral channel signals. The plurality of modified spectral channel signals are combined to form an enhanced output auditory signal.
The present invention is particularly applicable to use in electronic hearing aid devices for use by the hearing impaired, particularly for purposes of enhancing the spectrum such that impaired biological signal processing in the auditory brain stem is restored. An electronic hearing aid device incorporating the present invention may include a microphone for receiving sound and converting it into electrical signals, appropriate amplification and filtering, an analog to digital converter, a signal processor, such as a digital signal processor, implementing signal processing for enhancing the auditory signal in accordance with the present invention, a digital to analog converter, output side filters and amplifiers, and a speaker for providing the enhanced auditory signal to a wearer of the hearing aid device.
The present invention may be employed in any system wherein it is desired to make sounds, particularly speech sounds, more distinguishable. For example, the present invention may be incorporated into a computer speech recognition system. Such a system may include a microphone that converts a sound to an analog signal presented to an amplifier and filter, the output of which is provided to an analog to digital converter, which provides digital data to a signal processor, wherein processing in accordance with the present invention to enhance the auditory signal as provided. Alternatively, recorded signal data may be provided from a recording system directly to the signal processor. The output of the signal processor is provided to a speech recognition system, which itself may be implemented in a general purpose computer, with the output of the speech recognition system provided to output devices or to digital storage media.