Firstly, acoustic characteristics of speech are described.
FIG. 1A shows a diagram showing a frequency spectrum of speech. In FIG. 1A, the horizontal axis represents frequency and the vertical axis represents amplitude. A solid line 501 in FIG. 1A shows an example of speech represented by a frequency spectrum. A speech frequency spectrum has several peaks on the frequency axis. The peak of the lowest frequency indicates a fundamental speech frequency called a pitch, and is different depending on the tone of voice. In general, the peak of the lowest frequency is between 125 Hz and 300 Hz. Voice is a result of resonance (vibration) of a sound wave generated by vocal cord vibration, in the vocal tract which is a path from the pharynx to the lips. The resonance frequency is called a formant. The formant with the lowest frequency is called a first formant, the formant with the second lowest frequency is called a second formant, and so on. To be more specific, in FIG. 1A, the first peak of the lowest frequency indicates the pitch (i.e., the pitch frequency), the second peak indicates the first formant (i.e., the first formant frequency), and the third peak indicates the second formant (i.e., the second formant frequency). Generally speaking, although depending on the gender of an utterer and on uttered speech, the first formant frequency is in a range from 200 Hz to 1200 Hz and the second formant frequency is in a range from 800 Hz to 3000 Hz.
It is said that humans distinguish between vowels by a combination of the first and second formant frequencies. Although a consonant is identified mainly based on a change pattern in the beginning of speech on the time axis of the first and second formant frequencies, it is said that some consonants are identified from a spectrum shape pattern at a frequency higher than the second formant frequency.
In the field of auditory psychology, auditory masking occurs by which a sound is hard to hear because the sound is affected by a specific another sound. Auditory masking includes frequency masking and temporal masking. The frequency masking occurs when a large sound with a specific frequency component masks a sound with a frequency which is close to the specific frequency component and thus makes it difficult to perceive the sound at the close frequency. The temporal masking occurs when a preceding sound masks a subsequent sound and thus makes it difficult to perceive the subsequent sound.
The frequency masking is explained with reference to FIG. 1A. A dashed line 502 in FIG. 1A indicates a masking curve of the first formant component of speech. A listener cannot perceive a sound whose amplitude is lower than the dashed line 502. The masking curve varies from individual to individual, and a frequency width to be influenced by the masking curve also varies among the individuals. In the example shown in FIG. 1A, the first formant component masks the second formant component. In the case of a typical sound, the pitch component and the first formant component tend to be greater in power while the other components tend to be relatively smaller in power. On this account, when the first formant component masks the sounds in the nearby frequency bands as in the example shown in FIG. 1A, there is a possibility that vowels may be misheard.
Next, the temporal masking is explained with reference to FIG. 1B.
FIG. 1B is a diagram showing a temporal waveform of speech. In FIG. 1B, the horizontal axis represents time and the vertical axis represents amplitude. A solid line indicates a temporal waveform of speech uttered as “usa”. From the left side of FIG. 1B, parts corresponding to a vowel “u”, a consonant “s”, and a vowel “a” (i.e., partial speech) are temporally illustrated in this order. In the example shown in FIG. 1B, a dashed line indicates a time domain of temporal masking by the preceding vowel “u” which masks the subsequent consonant “s”. The temporal masking varies from individual to individual, and a width of the time domain influenced by this temporal masking also varies among the individuals. In the case of a typical sound, a vowel tends to be greater in power while a consonant tends to be relatively smaller in power. On this account, when the preceding vowel masks the subsequent consonant as in the example shown in FIG. 1B, there is a possibility that the consonant may be misheard or inaudible.
With the emergence of an aging society, the number of people with hearing loss is growing. As symptoms of hearing loss, decreases in hearing, in frequency resolution (frequency selection), and in temporal resolution are known. Due to a decrease in hearing, it is harder to perceive a soft sound as compared to a person with normal hearing. Due to a decrease in frequency resolution, the frequency band affected by the frequency masking is wider as compared to the case of a person with normal hearing. Thus, a person with hearing loss is likely to misidentify a vowel. Due to a decrease in temporal resolution, the length of time affected by the temporal masking is longer as compared to the case of a person with normal hearing. Thus, it is harder for a person with hearing loss to perceive a subsequent consonant.
Conventionally, hearing aid processing for simply amplifying the amount of sound has been performed to improve the hearing. In order to improve the frequency resolution and the temporal resolution, hearing aid processing called “dichotic-listening binaural hearing aid” has been proposed to reduce the influence of the hearing masking (see Non Patent Literatures 1 and 2, for example). By this processing, an acoustic signal (a signal indicating a sound including speech) is divided on the frequency axis, and different signal characteristics of the divided acoustic signals are presented to the right and left ears, respectively, so that these signals are perceived as one sound in the brain. The dichotic-listening binaural hearing aid processing has been reported to increase the clarity of speech.
It is thought that the dichotic-listening binaural hearing aid processing increases the clarity of speech by presenting an acoustic signal in the masking frequency band (or an acoustic signal in the masking time domain) and an acoustic signal in the masked frequency band (or an acoustic signal in the masked time domain) to different ears, respectively, to make the masked speech perceivable.
FIGS. 2A and 2B are diagrams each showing a frequency spectrum of speech on which the dichotic-listening binaural hearing aid processing has been performed. In FIGS. 2A and 2B, the horizontal axis represents frequency and the vertical axis represents amplitude as in FIG. 1A.
As shown in FIG. 2A, speech which can be heard by one ear as a result of the dichotic-listening binaural hearing aid processing is only speech in a low frequency band. Also, as shown in FIG. 2B, speech which can be heard by the other ear as a result of the dichotic-listening binaural hearing aid processing is only speech in a high frequency band. Therefore, the speech in the second formant frequency can be prevented from being masked (by the frequency masking) by the speech in the first formant frequency.
FIGS. 3A and 3B are diagrams each showing a temporal waveform of speech on which the dichotic-listening binaural hearing aid processing has been performed. In FIGS. 3A and 3B, the horizontal axis represents time and the vertical axis represents amplitude as in FIG. 1B.
As shown in FIG. 3A, speech which can be heard by one ear as a result of the dichotic-listening binaural hearing aid processing is only speech in a low frequency band, that is, only the vowels “u” and “a”. Also, as shown in FIG. 3B, speech which can be heard by the other ear as a result of the dichotic-listening binaural hearing aid processing is only speech in a high frequency band, that is, only the consonant “s”. Therefore, the consonant “s” can be prevented from being masked (by the temporal masking) by the vowel “u”.