The background description provided herein is for the purpose of generally presenting the context of the invention. The subject matter discussed in the background of the invention section should not be assumed to be prior art merely as a result of its mention in the background of the invention section. Similarly, a problem mentioned in the background of the invention section or associated with the subject matter of the background of the invention section should not be assumed to have been previously recognized in the prior art. The subject matter in the background of the invention section merely represents different approaches, which in and of themselves may also be inventions. Work of the presently named inventors, to the extent it is described in the background of the invention section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the invention.
Cochlear implants are neural prostheses that help severely-to-profoundly deaf people to restore some hearing. Physically, three components can be identified, the speech processor with its transmission coil, the receiver and stimulator, and the cochlear implant electrode array. The speech processor receives sound from one or more microphones and converts the sound into a corresponding electrical signal. While the hearing range of a young healthy human is typically between 0.02 and 20 kHz, it has been assumed for coding of acoustic information in cochlear implants that most of the information used for communication is in the frequency range between 0.1 and 8 kHz. The frequency band from 0.1 to 8 kHz is divided into many smaller frequency bands of about 0.5 octaves width. The number of small frequency bands is determined by the number of electrodes along the electrode array, which is inserted into the cochlea. Each frequency band is then treated by a mathematical algorithm, such as a Hilbert transform that extracts the envelope of the filtered waveform. The envelope is then transmitted via an ultrahigh frequency (UHF) connection across the skin to a receiver coil, which was surgically implanted behind the ear. The envelope is used to modulate a train of pulses with a fixed pulse repetition rate. For each of the electrodes, a train of pulses with fixed frequency and fixed phase is used to stimulate the cochlear nerve. Multiple algorithms have been implemented to select a group of 4-8 electrode contacts for simultaneous stimulation.
Biological Constraints that Affect Performance of Cochlear Implants:
Damage of cochlear neural structures can result in severe deafness. Depending on the neural degeneration in the cochlea performance, the performance of a cochlear implant user may vary. Changes that occur include the demyelination and degeneration of dendrites and neuronal death [1]. The neuronal loss can be non-uniform and results in “holes” of neurons along the cochlea. Holes lead to distortion of the frequency maps, which affects speech recognition [2]. Caused by changes in myelination and synapse size, changes in firing properties of the nerve were described such as prolonged delay times and changed refractory periods [3-6]. In the brainstem and midbrain the neuronal connections appear to remain intact. However, a decrease in neuron size, afferent input, synapse size and density can be detected [7-13]. Neural recordings reveal a change in response properties that adversely affect temporal resolution such as elevated thresholds, de-synchronization, increased levels of neural adaptation, increased response latencies. A loss of inhibitory influences has been described. At the cortex, spatially larger cortical activation was seen with (PET) [1, 14-16]. The findings support a plastic reorganization and more intense use of present auditory networks [15, 17-25].
Technical Constraints that Affect Performance of Cochlear Implants:
A normal functioning cochlea has 3000-3500 hair cells to encode the acoustic information along the cochlea. More than 30 perceptual channels can be processed in parallel [26]. The entire frequency range is between 0.02 and 20 kHz where high frequencies are encoded in the cochlear base and low frequencies in the cochlear apex. In addition to a spatially selective processing of information along the cochlea, frequency information is conveyed by rate and phase lock. Phase lock in the pristine cochlea is up to 5 kHz. The loudness range is 120 dB with 60-100 discernable steps. In contrast to a normal hearing listener, a cochlear implant device divides the acoustical signal into less or equal to 22 frequency bands and can only parallel process 4 to 8 channels [27-30]. The limitation comes from the current spread in the cochlea, which leads to large sections of stimulation along the cochlea [31-37]. The delivery of the information is not precise, which is caused by the uncertain placement of the electrode [29, 38]. Shift of the frequency place, which can be caused by a, result in poor speech recognition [39]. The frequency range of a CI is typically limited 0.4 to 8 kHz. In particular, the inability of processing information below 0.4 kHz is crucial for speech recognition. Providing frequency information through phase lock is limited since phase lock in a CI is typically not more than 0.3 kHz. The loudness range is reduced to 6 to 30 dB with about 20 discernable steps [40].
Temporal Envelope (TE) and Temporal Fine Structure (TFS) are Both Important to Recognize Speech, Distinguish Speech in Noise, and for Music Appreciation.
Sinusoidal signals are used as standard stimuli for auditory research. The stimuli are well defined regarding their frequency, amplitude and timing. This makes it convenient for data analysis for two factors: (1) the cochlea is treated as a frequency analyzer and (2) linear system analysis can be applied.
However, the cochlea is a highly nonlinear system, which processes complex natural sounds for communication. In general, acoustical signals can be decomposed into a slowly varying temporal envelope (TE, FIG. 1) and a rapidly varying temporal fine structure (TFS, FIG. 1) close to the center frequency of a given frequency band. The inner ear of a normal hearing subject or animal makes this decomposition. The cochlea then encodes TE by a rate-place code and TFS by a temporal code [45-57]. While psychophysical experiments demonstrated that both TE and TFS are important for performance in normal hearing subjects, it is still debated whether for hearing restoration with cochlear implants both components must be equally considered [41, 58-62]. This is of particular interest because, from a theoretical point of view, the TE can be recovered from TFS and vice versa [42-44].
Temporal Envelope (TE):
Similar to a frequency analyzer, the basilar membrane of the cochlea “separates” the frequencies contained in an acoustical signal into small frequency bands and maps the energy of each band to a fixed site along the cochlea. The acoustic energy of each frequency band is converted into a corresponding rate of action potentials of an auditory nerve fiber. The TE of a few spectral bands provides sufficient information for speech intelligibility in quiet [59, 63, 64].
Temporal Fine Structure (TFS):
In contrast to TE, the TFS of the acoustical signal plays an important role when speech is presented against a complex background noise [65-69] and for music perception. In the auditory nerve, TFS is encoded by phase-locked responses. At present, TFS is greatly neglected in CI coding strategies, which rely mostly on the envelope of the acoustical signal.
Evaluation of the Role of TE and TFS:
An elegant method to study the relative roles of TE and TFS are auditory chimaeras, which are acoustical constructs that have the TE of one sentence (or melody) and the TFS of another sentence (or melody). In quiet listening environments, speech constructed from two sentences is perceived as the sentence that provided the TE. This was different for music and for speech in noise. Instead, the stimulus that provided the TFS was recognized. Those experiments underline the importance of both TE and TFS for hearing [41]. Carefully designed experiments can be used to further our understanding of the role of TE and TFS in coding of acoustic parameters to enhance speech, speech in noise, and music recognition. One should also be aware that the approach with chimaeras has limitations because TE and TFS are not independent. TE can be recovered from TFS and vice versa. While in human testing it is difficult to tease apart true TFS and recovered TFS, this can be achieved in animal experiments where TFS is shown as a phase locked response of the nerve fiber.
At present, the temporal fine structure is greatly neglected and coding strategies rely solely on the temporal envelope of the acoustic signal. To better understand the arguments about the temporal envelop and the temporal fine structure, the terms are reviewed and techniques to implement them into coding strategies of CIs are discussed below.
Most of the information on which speech recognition is based is contained in the frequency band between 0 and 4000 Hz. For example, high-grade telephone channels have typically a channel capacity of 20,000 bits/s. If English phonemes are tabulated together with the probability of occurrence the average information per phoneme is 4.9 bits/s. In conversation about 10 phonemes are uttered per second. Consequently, about 50 bits/s of channels capacity would be sufficient to convey the information at the written equivalent of speech.
Basic Consideration for Coding Strategies:
Different strategies for coding acoustical information are shown in FIG. 2. The solutions depicted in panels A and B of FIG. 2 are not practical for cochlear implants and are presented for completeness. A train of biphasic pulses with constant amplitude (ca) and constant frequency (cf) has limited use because only single amplitudes can be encoded by the amplitude of the pulses and one single frequency by the pulse repetition rate (panel A of FIG. 2) or the stimulation site along the cochlea. Additional timing information can be added if the times are considered at which the carrier of the acoustical signal has zero crossings and the slope of the carrier is positive (panel B of FIG. 2). The TE of the acoustical signal can be used to modulate a constant carrier (panel D of FIG. 2). To avoid simultaneous stimulation at neighboring electrodes the carrier pulses are presented at adjacent electrodes in a continuous interleaved pattern (CIS) [71]. The latter strategy is commonly used in coding strategies of contemporary CIs. Some codes adopt the approach shown in panel E of FIG. 2. A carrier is amplitude modulated with the TE of the acoustical signal. The zero crossings of the TFS are then used to provide additional timing information. Coding strategies depicted in panels C and F of FIG. 2 have not been implemented in CI today. TFS is also included in the n-of-m coding strategy (n number of frequencies are selected of m possible frequencies) and by implementing virtual channels (stimulation between two electrode contacts by electrical field superposition). Current steering increases the number of possible frequencies that may be selected [72-75]. The n-of-m coding strategy can be seen as an alternative to CIS [76-78].
Nie and coworkers have proposed a coding strategy depicted in panel F of FIG. 2 [69, 79, 80], which encodes both amplitude and frequency of the signal. They suggest dividing the acoustical signal into frequency bands and extracting the temporal envelop, as has been done before. They suggest to also encode the frequency fine structure in each frequency band. Their test with CI user demonstrated, as others have before that TE, is sufficient to encode speech in quiet listening environments. They have also demonstrated that adding fine structure to the code can significantly improve speech recognition scores in normal hearing subjects when background noise is present [69, 79, 80].
While the temporal envelope is used in contemporary CI coding strategies, the temporal fine structure receives little attention.
In general, CI coding strategies accomplish three tasks: the extractions of the relevant information from the acoustical stimulus, the conversion of that information into an electrical stimulus, and the delivery of the stimuli via the cochlear implant electrode to the ear. Step one includes the filtering of the acoustical signal with pre-emphasis filter, dividing the acoustical signal into individual frequency components using filters, gammatone like filters, or spectrograms. From each frequency band the TE and TFS are extracted. In step two, different strategies are employed to select a few number of frequency bands. Typical strategies for this task are FOF1F2, multi peak (MPEAK), spectral peak (SPEAK), advanced combination encoder (ACE), n-of-m, fine structure processing (FSP). The TE of the selected frequency band is then used to modulate the amplitude of a carrier. The carriers are trains of biphasic electrical pulses delivered at a fixed rate. Different carriers have been tested: sinusoids, broad-band noise, narrow band noise and pulse trains with fixed and variable pulse rates. Typically, a nonlinear mapping function is applied. TFS is largely ignored in this process. In the third step, pulses are delivered via the CI electrode. If pulses occur at the same time at neighboring electrodes, interactions are possible. To avoid deleterious effects, two strategies are applied, current steering and CIS [62, 81]. Current steering uses the interaction between two neighboring channels to evoke a percept, which is between the sites of the electrodes used to deliver the current. This strategy increases the number of pitches a CI user can perceive [72]. CIS introduces delays between the pulses delivered at neighboring channels. The delays are large enough that simultaneous stimulation does not occur [71].
Five coding strategies, HiRes120, n-of-m, FSP, ACE, and SPEAK from three major cochlear implant companies, Advanced Bionics, Med-El and Cochlear Ltd., were analyzed to determine whether TFS is included in the CI coding strategy and how it is implemented. For two strategies, HiRes120 and n-of-m, both filter the acoustical signal captured by the microphone using a sequence of 16 bandpass filter (6th order Chebyshev with 3 dB ripple). They map the energy in each frequency band with a nonlinear function and generate trains of interleaved biphasic pulses with a high stimulation rate. The goal of the HiRes120 strategy is to increase temporal and spectral resolution and consequently introduce TFS [82-92]. To achieve the required higher spectral resolution, filtered signals from each channel are analyzed by using the FFT of the signal to select maxima that represent dominant frequency components. If selected frequencies are within the resolution band of a patient, the one with smaller magnitude is eliminated. The resolution band is the frequency range where patients could not distinguish pitch differences and is set as ⅓ octave in the HiRes120 code. Current steering is then used to stimulate the corresponding sites along the cochlea. The stimulation sites are determined by using the selected frequencies and the Greenwood function [93], which maps the frequencies tonotopically along the cochlea. Furthermore, to achieve better timing resolution, this strategy also uses the half-wave rectified waveform to modulate the electrical pulse trains [94]. The n-of-m strategy differs only in the criteria of selecting channels. A threshold is determined by calculating the average energy of all channels [95]. The energy of each channel is then subtracted by the threshold value and the difference d is used to calculate a probability p. Each channel is assigned with a random number r within [0,1], which is then compared with p accordingly. In case that r is less than p, then the channel is selected. After testing all channels, if the number of selected channels is smaller than 6, then unpicked channels are randomly selected until all six channels are selected. If the number of channels is larger than 6, picked channels are randomly deselected until six channels remain. The threshold is also adjusted accordingly [96].
Another coding strategy that considers TFS uses a combination of a High Definition CIS (HDCIS) strategy for high frequencies (electrode contacts at the basal cochlear end) and channel-specific sampling sequences (CSSS) at one to four most apical channels or electrode contacts. The channel selection is based on the finding that only neurons at low frequency signal (<1 kHz) phase lock in severe-to-profound hearing impaired, and hence respond to TFS. The filters used in this strategy are a series of bell curve shaped bandpass filters with overlap at −3 dB cutoff frequencies [97]. This can be achieved with a gammatone filter. The frequency range is 70 to 8000 Hz, separated into 12 channels. At the high frequency channels, the signals are Hilbert transformed to extract the TE. The TE obtained at each channel is then used to modulate the amplitude of a carrier (sequence of interleaved biphasic pulses at a rate of about 1500 pulses per second (pps)). At the low frequency channel, TFS is encoded. The acoustical signal are half-wave rectified such that only positive values remain. Positive zero points, which are points with zero value and positive derivatives on the waveforms, are identified from rectified signals and marked as the time point to generate pulse trains using the CSSS strategy. The rate of pulse trains is fixed and patient specific. This method gives a non-constant rate of stimulation among apical channel, where length of the pulse bursts depends on the center frequencies. This approach provides TFS information of the acoustical signal [98].
Two other common coding strategies do not introduce TFS and rely on TE only. In the ACE strategy, a pre-emphasis filter with a shape of equal loudness threshold among frequencies is used. The goal is to emulate the “perceptual power” similar to what normal hearing people perceive. Next, the signal is analyzed by two spectrograms: one with 256 points window size, which has better spectral resolution, and one with 64 points window size, which has better time resolution. Both spectrograms are vocoded to 22 channels. The “256-window” spectrogram is mainly used as the criteria for channel selection. In this strategy, 8 channels with highest energies are selected. After the channel is selected, the two adjacent channels are checked to see if the energy differences are below the threshold limit, defined as 20 dB less energy than the selected channel. If the energy is below the threshold limit, then that channel is considered to be masked by the selected channel and should be eliminated from proceeding channel selections [99]. Once all 8 channels are selected, energy derived from “64-window” spectrogram is considered as the stimulation power for each selected channel. Similar to strategies described above, energies are mapped to current levels and pulses are delivered as interleaved biphasic manner at a high stimulation rate (2500 pps) [100].
The other method, SPEAK, is a low pulse repetition rate coding strategy. The pre-processed signal is analyzed with a 256 points window size spectrogram and then vocoded to 22 channels. A gain factor G is calculated for each channel every 3 frames (48 ms) using the following function: G=(2×Ec−2×Ep−Ef)/(Ec+Ep+Ef), where Ec, Ep and Ef are energy of the first, second and third frame. G describes how fast the envelope energy varies, and emphasizes the transients of the speech signal where envelope changes quickly. G must be larger than zero and smaller than 2. A modified signal S′ is then computed using S′=S×(1+K×G), where S is the original signal and K is the modifier constant with value of 2 [41, 101]. Eight channels with the highest energies are selected and the pulse amplitudes are mapped to current levels [102]. The method of delivering pulses in this strategy is different. Either a pulse burst, which is a sequence of ramping pulses [103], or a ±10% jitter of the pulse timing can be used [104]. Both methods aim to introduce stochastic time patterns of neural activity.
A coding strategy has been described by Nie et al. [69] that considers TE and TFS. Algorithms for extracting AM and FM. The acoustical signal is divided into N subbands by a bank of bandpass filters. Within each subband, amplitude and frequency modulations are extracted in separate pathways. The output of each subband k is full-wave rectified and then filtered with a low pass to obtain the AM signal. Delay compensation is also introduced to synchronize the amplitude and frequency modulation pathways. A pair of orthogonal sinusoidal signals at the center frequency of the k th subband is used to remove the center frequency from the original signal and to extract frequency modulation around the center frequency. The resulting signal is Low-pass filtered. The instantaneous frequency is calculated from the in- and out-phase signal. The instantaneous frequency is further band-limited and low-passed filtered to limit the frequency modulation rate. However, two problems limit the use of the approach described: (1) the trains of pulses are still amplitude modulated and increasing current amplitudes results in in increasing interactions between neighboring electrodes. (2) frequency modulation is retrieved from each subband and used to modulate the spike patterns. Additionally, this approach does not take into account the phase information.
Therefore, a heretofore unaddressed need exists in the art to address the aforementioned deficiencies and inadequacies.