The analysis and modeling of the human audio system have for a long time formed a main area both in recognizing and classifying audio signals and in medical technology. At that, in particular the setup of the human ear has been studied for a long time. In order to enable an understanding of the present invention, in the following some basic findings regarding the fundamentals of auditory perception are presented.
Physiology: Auditory Periphery and Central Audition
The physiological circumstances of the human auditory periphery have meanwhile been researched well and may be looked up in a plurality of scientific documents. Thus, at this point only the main basic facts necessary for the further understanding of later explanations are to be presented.
The peripheral sound processing apparatus of man (see FIG. 20) consists of the totality of outer ear, middle ear and inner ear. Through the acoustic meatus, the sound reaches the eardrum and is passed on in the middle ear via the ossicles. A subsequent processing in the inner ear causes a frequency-dependent transduction of mechanical oscillations into neural nerve action potentials and passing the same on to the connected auditory nerve fibers.
Outer Ear:
The outer ear forms a funnel leading the incoming sound waves to the eardrum. The auricle, the auditory canal, the form of the scull and shoulder modify the sound signal.
As the auditory canal (including auricle) is open at one end and closed at the other, it is physically approximately considered as a half-open tube. Thus, in the case of resonance, i.e. when a quarter of the sound wavelength corresponds to the effective auditory canal length, a sound pressure level gain may be observed. In the resonance maximum at approximately 2500 Hz, the amplification is up to 20 dB. A second resonance (“Cavum Conchae resonance”) is caused between 2000 Hz and 2500 Hz by the auricle alone.
Depending on the sound incident direction, as a result of the shape of the outer ear by so called “direction-determining bands” individual narrow frequency ranges are boosted or lowered, respectively. By this, up to a certain measure, the localization of incoming sound is also possible without binaural time and intensity differences, in particular in the vertical plane (median sagittal plane).
The described phenomena may be summarized by the outer ear transfer function (or “head related transfer function” HRTF, respectively), illustrated in FIG. 21.
Middle Ear:
The main task of the middle ear (MO) consists in adapting the sound characteristic impedance of air and of the liquids within the inner ear. If such a functionality is missing, like in the case of sound transmission hearing disability up to 98% of the incoming sound energy is reflected. With a healthy middle ear, around 60% of the signal intensity may be passed on to the inner ear. The sound pressure amplification necessary for this is made possible by the lined up coupling of eardrum, the three ossicles (hammer, anvil and stapes) and the oval window as a contact location to the inner ear (see FIG. 22).
Three different mechanisms are responsible for this impedance transformation:
1. Area ratio of eardrum AT and stapes sole plat AS:
            A      T              A      S        ≅  172. Ratio of the lever arms of hammer lH and anvil lA:
                    l        H                    l        A              ≅    1    ,  33. Lever arm by the curvature of the eardrum and the asymmetrical suspension of the hammer:FT≈1,4
The overall amplification is calculated to be:
            p      ges              p      T        =                    F        T            ⁢                        A          T                          A          S                    ⁢                        l          H                          l          A                      ≅          30      ⁢                          ⁢      dB      (pT: sound pressure at the eardrum).
The importance of the transfer function of the MO is remarkable, acting like a band pass filter having a wide passband. In the low frequency range it is limited by the mechanical characteristics of eardrum and oval window. With high frequencies, the moments of inertia and friction and bending losses of the ossicles limit the transmission. If the course of the MO transmission function is compared to that of the hearing threshold (see FIG. 23), it may be seen that the auditory sensitivity curve is mainly determined by the mechanical characteristics of the middle and outer ear.
An additional task is fulfilled by the muscles of the MO (M. tensor tympanus and M. stapedius, see FIG. 20). By a reflex contraction, the MO stiffness may be increased and thus an attenuation of lower frequencies may be achieved. A limited protection with regard to high levels and a reduction of the perception of self-produced sounds are the consequence.
Inner Ear:
The structure of the inner ear consists of two units. While the vestibular organ represents a component of the system of equilibrium, the setup of the cochlea forms the final part of the auditory periphery (see FIG. 22). Anatomically, the cochlea is equal to a snail shell having two and a half windings. It is separated into the two chambers “scala vestibuli” (SV) and “scala tympani” (ST) (see FIG. 22) containing perilymph liquid by the cochlear partition.
The operation of the cochlea may again be described in two sections. The hydromechanical part is determined by the macro- and micro-mechanical characteristics of the interior of the winding. The actual functional unit for converting the input signals into neural representations is located within the cochlear partition. The scala vestibuli is connected to the middle ear via the oval window (OW). The same oscillates with the movement of the stapes and thus forces the incompressible lymph liquid to elude. The elusion movement is then passed on to the cochlear partition and forms a traveling wave into the direction of the helicotrema (HC), cochlea spike. Due to the continuously changing mechanical characteristics along its extension (mass cover, rigidity, width, etc.) the partition forms frequency-dependent resonances at certain locations. This tonotopical frequency selectivity is also referred to as location theory.
Locations of maximum wave amplitudes may be associated with the characteristic frequencies on the partition, continuously reaching from high frequencies in the area of the oval window (basis of the basilar membrane) to low frequencies at the helicotrema (end or apex, respectively, of the basilar membrane). Via this dispersion characteristic, frequency contents in the incoming audio signal may be split up to a certain extent.
This functionality is supported by the characteristics of the cochlear division wall (see FIG. 24). The same is closed towards the scala vestibuli by the Reissner membrane (RM). The interface to the scala tympani consists of the basilar membrane (BM) including mounted organ of Corti (CO) on whose top side in the longitudinal direction three rows of outer hair cells and one row of inner hair cells are located. These hair cells are again spanned by the tectorial membrane (TM).
In the area in between the endolymph liquid of the scala media is located. When the cochlear partition moves, the tectorial membrane and the organ of Corti get into a relative movement which leads to a deflection of the sensory hairs located on the hair cells. This partially happens by direct contact, partially, however, also by hydrodynamic coupling. The outer hair cells now have the capability to shorten or lengthen, respectively, very quickly depending on the partition oscillation. This leads to an amplification by up to 1000 of the traveling wave amplitudes and provides sharp and distinct oscillation maxima.
Just like the sensory hairs of the outer hair cells, those of the inner hair cells are also deflected by the relative movement of tectorial membrane and organ of Corti. The measured, three-dimensional movement of the cochlea is complicated and was for example determined by Zenner and Gummert at the University of Ulm. As a consequence of this movement, biochemical processes are started causing a transduction of mechanical movements into neural action potentials (see FIG. 25).
In a state of rest, the inner hair cells have a resting membrane potential of about −40 mV and a low potassium concentration. The surrounding liquid of the scala media, however, comprises an unusually high proportion of potassium ions and is positively charged. With a deflection of the sensory hairs into one direction, so called transduction ion channels open, through which an inflow of positively charged potassium ions into the hair cells takes place due to potential equalization. A deflection of the sensory hairs into the opposite direction closes those channels and through ionic compounds into the baso-lateral cell membrane, the original potential may be reestablished. When the channels are open, the changed sensor potential causes an increased release of afferent transmitter substance.
The same diffuses through the synaptic cleft into the direction of the auditory nerve. Depending on the transmitter concentration in the synaptic cleft, the probability of triggering a nerve action potential (NAP) is increased.
Up to a frequency of just about 5000 Hz the release of the transmitter substance is highly synchronous following the deflection of the sensory hairs. Thus, a linear frequency transmission may occur via a time encoding, which is summarized in literature under the term of “phase locking”.
Further, reference is also made to FIG. 26, which again shows the anatomy of the auditory periphery. FIG. 26 here shows the conversion or transmission, respectively, of a noise via eardrum and middle ear to the cochlea. The cochlea here enables a spectral analysis of the incoming noise and a conversion of vibrations into neural impulses. The cochlea further comprises nerve cells generating nerve impulses (action potentials), which are passed on via the auditory nerve to the brain.
FIG. 27 again shows in schematical form the mechanism of signal transmission in the human ear. From FIG. 27 it may be seen, that the cochlea 3210 recognizes different frequencies at different locations (location theory). For example, high frequencies (e.g. with a frequency of 20 kHz) are converted into nerve signals at the beginning of the cochlea, while low frequencies (e.g. with a frequency of 20 Hz) are converted into nerve signals at the end of the cochlea. By this, in the cochlea both a spectral analysis of a noise or an audio signal, respectively, may take place, wherein for a predetermined frequency those nerve cells are excited most that are most suitable for a perception of the respective frequency.
FIG. 28 shows the setup of the organ of hearing, wherein reference is also made to the geometry of the basilar membrane. A graphical illustration 3310 here shows that the width of the basilar membrane 3320 increases from the basis of the cochlea towards the end (apex) of the cochlea by a factor of 10.
A graphical illustration 3320 further shows a coupling of an acoustic wave into the cochlea via an oval window 3330. The coupling in via the oval window 3330 generates a traveling wave in the cochlea traveling from the basis 3340 of the cochlea to the apex 3350 of the cochlea and thus deflecting the basilar membrane 3360 of the cochlea. It is to be noted here, that nerve cells which are located closer to the basis 3340 of the cochlea are excited earlier than nerve cells which are located further from the basis 3340 of the cochlea. In other words, the location of the traveling wave as a function of time may be regarded as a trajectory of the traveling wave. The trajectory may of course also be mapped to discrete nerve cells, so that a trajectory also describes in what time sequence several spatially separated nerve cells are excited by a traveling wave.
FIG. 29 shows an exemplary electric replacement model, by the help of which the propagation of sound waves through the cochlea up to the excitation of the inner hair cells may be modeled. The illustrated model is known as the “extended Zwicker model”. The model for example describes the hydromechanic of the inner ear and the nonlinear feedback of the outer hair cells. It is noted, however, that the illustrated model is only one of many possible models for calculating the excitation of the inner hair cells.
FIG. 30 describes, in a schematical illustration, the organ of Corti, and FIG. 31 describes the setup of two different types of hair cells.
FIG. 32 shows a detailed schematical illustration of two hair cells. The schematical illustration of FIG. 32 is designated by 3700 in its entirety. With reference to the graphical illustration 3700, here for improving the understanding the chemical processes within an inner hair cell are shortly outlined.
The hair cell 3710 comprises a plurality of stereocilia 3720 having the shape of fine hairs. An excitation or deflection, respectively, of the stereocilia causes the transmittance or conductivity, respectively, of a cell membrane to change, so that positively charged potassium ions 3730 may enter the hair cell. By this, the intra-cellular potential of the hair cells changes, often designated by V(t). Depending on the intra-cellular hair cell potential V(t), positively charged calcium ions 3740 may enter the cell so that the concentration of calcium ions 3740 is increased. The calcium ions then act upon the release of neurotransmitter molecules 3750 into a synaptic cleft 3760 between the hair cell 3710 and a nerve fiber 3770. The release of the neurotransmitter molecules 3750 typically takes place quantized in vesicles of several thousand molecules.
The concentration of neurotransmitters in the synaptic cleft 3760 then changes the potential in the synaptic cleft 3760. If the potential in the synaptic cleft 3760 exceeds a certain threshold value, then finally an action potential in the nerve fiber 3770 is generated.
FIG. 33 for clarifying finally shows the arrangement of a plurality of hair cells in a sensory point of a human cochlea. From the illustration of FIG. 33 it may be seen that one individual hair cell typically comprises a plurality of stereocilia (hairs) and is coupled to a plurality of nerve fibers.
Some approaches already exist to process or identify audio signals, respectively, with reference to the processes in human hearing. For example, Thorsten Heinz and Andreas Brückmann describe in the article “Using a physiological ear model for automatic melody transcription and sound source recognition” presented in the 114th Meeting of the Audio Engineering Society in Amsterdam, The Netherlands in March 2003, an audio signal analysis and modifications of conventional signal processing algorithms which are perception-oriented.
The above-mentioned article describes a simulation of the functionality of the inner ear including the conversion of mechanical vibrations into information about a concentration of a transmitter substance in the clefts of the inner hair cells. The basilar membrane is here separated into 251 regions of uniform width, and each segment is connected to an inner hair cell, wherein the inner hair cell is excited by vibrations of the corresponding section of the basilar membrane. For a pitch recognition, the concentration of the transmitter substance in the clefts of the 251 described hair cells is then analyzed.
To this end, pitch trajectories are formed and segmented. Further, the mentioned article shortly describes the recognition both of a timbre and a melody recognition.
Further, Toshio Irino and Roy D. Patterson describe in their article “Segregating Information about the size and shape of the vocal tract using a time domain auditory model: The Stabilized Wavelet-Mellin Transform” (published in the Elsevier Journal for Speech Communication 36, 2002, pages 181-203) the application of a two-dimensional Mellin transformation to an auditory image. According to the above article, the Mellin transformation generates a Mellin image from the auditory image which is invariant with regard to the size of a vocal tract of a speaker on whose speech signal the auditory image is based.
The above-mentioned article proposes a speech recognition using a so called Mellin image resulting through a spatial Fourier transformation from a size-shape-image. The size-shape-image is, however, according to T. Irino and R. D. Patterson gained from a stabilized auditory image through a plurality of conversion steps.
Further, A. Brückmann, F. Klefenz and A. Wünsche describe in the article “A neural net for 2D-slope and sinusoidal shape detection” (published in the CIST International Scientific Journal of Computing, ISSN 1727-6209) a neural net for pattern recognition. The described neural net may learn straight lines of different slopes or a set of sinusoidal curves of different frequencies and may recognize corresponding patterns after the learning phase. The corresponding neural net thus realizes a Hough transformation and thus enables a recognition of two-dimensional patterns.