The acoustic signal reaching a listener's ear is typically a mixture of several sounds originating from different sound sources. The human auditory system utilises a large number of simultaneous and sequential cues in the received sounds to segregate them from other sounds received at the same time (Bregman 1990). The ability to combine cues in time and frequency further allows, especially normal-hearing listeners, to correctly interpret received sounds, even when they are strongly degraded, e.g. due to masking by other sounds or due to transmission via channels with poor transmission characteristics.
The Temporal Fine Structure (TFS) of a sound carries cues which in some situations may be crucial to a listener for identifying and locating the sound source, as well as for understanding the meaning of the sound (Hopkins, Moore and Stone 2008). The TFS also carries cues that allow segregation of sounds from multiple sound sources. For instance, Andersen and Kristensen have shown that normal-hearing listeners benefit—in terms of speech recognition thresholds—from both monaural and binaural TFS cues in a difficult listening situation with 3 spatially separated speakers (Andersen et al. 2010).
Recent experiments have shown that—compared to normal-hearing listeners—hearing-impaired listeners have a reduced sensitivity to TFS cues in acoustic signals (Hopkins and Moore 2007; Moore and Sek 2009) and are less able to utilise TFS cues in difficult listening situations with two simultaneous speakers (Hopkins, Moore and Stone 2008; Lunner et al. 2011). The stimuli in the TFS 1 Test (Moore and Sek 2009) were presented at positive sensation levels (i.e. above the individual hearing threshold), and the reduced sensitivity is therefore probably not caused by limited audibility of the stimuli, especially, as normal-hearing listeners' performance did not improve with increased sensation levels (Moore and Sek 2009). Furthermore, there is growing evidence that aging also contributes to limiting the access to TFS cues (Hopkins and Moore 2011; Ruggles, Bharadwaj and Shinn-Cunningham 2011).
Naturally occurring sounds are typically time-varying signals with spectral components occupying a relatively wide portion of the audible frequency range. To facilitate decoding of cues from a sound, all its spectral components should preferable be conveyed to the listener without distortion. This is, however, not always possible. It is, for instance, quite common that portions of the sound spectrum of a useful sound are masked by other sounds or noise—and/or attenuated by band-limited sound transmission channels.
A poor signal quality decreases the human auditory system's ability to correctly decode cues in the sounds. To compensate for this decrease, the listener has to employ cognitive skills and e.g. exploit redundancies in spoken words in order to understand what is said. Poor sound quality may thus obviously reduce intelligibility and lead to misunderstandings, but also stresses the listener and reduces the listener's general awareness. Many audio systems therefore comprise means for reducing or preventing noise in the processed sound as well as means for avoiding loss of spectral components during narrowband transmission. The methods that are traditionally employed to achieve such improvement of the sound quality include noise reduction, the use of directional microphones as well as the use of algorithms for bandwidth compression and decompression.
In hearing aids, the use of noise reduction and directional microphones allows increasing the signal-to-noise ratio (SNR) by attenuating audio signals that it is assumed that the listener is not interested in. The decision as to what is interesting may be based on assuming that the target (the source of the useful sound) is in front and maskers (the noise sources) are behind the listener, cf. (Boldt et al. 2008), and/or on a discrimination between speech and noise, cf. (Elberling, Ekelid and Ludvigsen 1991). In many situations that comply with these assumptions, such methods may be beneficial for hearing-impaired listeners. However, in other situations, such methods may provide limited benefits, e.g. if all sounds are speech and appear in front of the listener. Furthermore, if the listener is actually interested in dividing the attention among multiple sound sources, attenuation of some of the sources may be disadvantageous.
Frequency transposition and non-linear frequency compression (Neher and Behrens 2007) may enhance hearing-impaired listeners' access to multiple sound sources in situations that do not comply with the above mentioned assumptions. Similar benefits may be achieved by enhancing the spectral contrast with critical-band compression (Yasu et al. 2008), where the frequency contents of each critical band is compressed to decrease the width of the basilar membrane excitation and thus decrease the spectral masking effects. A common side-effect of such methods is, however, that harmonic relations between partials of the sound are broken.
Note that in the present context, the term “partials” refers to the fundamental frequency and its harmonics or overtones in a composite spectrum.
Listeners generally tend to pay more attention to loud sources than to quiet sources. A well known and very simple means for increasing the intelligibility of speech is thus to increase its loudness relative to other sounds. The same applies to other useful sounds to which it is desired to draw the attention of a listener. A simple increase in the sound pressure level of a useful sound is, however, not always practical. It may e.g. lead to increased power consumption and/or distortion in the audio systems, earlier occurrence of listener fatigue, disturbance of others, amplification of noise accompanying the useful sound, etc.
Humans are generally able to order sounds according to their loudness, which is a subjective measure of the perceived strength of the sound. When two sound sources are located equally far away, a listener will typically rate the strengths of the sound sources in the same order as the loudness of sounds received from the respective sound sources. If the distances to the sound sources differ, listeners normally non-consciously compensate for the effects of different transmission paths when rating the strengths of the sound sources. Listeners are thus typically able to correctly rate a far, loud sound source stronger than a near, weak sound source, even when the listener actually receives the sound from the weak source at a higher sound pressure level than the sound from the loud source.
The mechanisms behind the above described human ability to compensate for different distances are not completely known. John M. Chowning suggested a model called “auditory perspective” as a basis for understanding some of the mechanisms (Chowning 2000). According to Chowning, the listener's auditory system uses various cues in received sounds to place the sources of the sounds at different distances and determines loudness of the sources analogously to how the visual system functions. Chowning suggests that useful loudness or distance cues may include e.g. spectral envelope shape, timbral definition and the amount of reverberation.
Within the context of the present patent application, the above described subjective measure of the perceived strength of a sound source is termed “apparent loudness”. In other words, the apparent loudness of a sound source is a subjective measure of the perceived strength of the sound source after (non-conscious) compensation for the distance between the sound source and the listener. Correspondingly, the apparent loudness of a sound equals the apparent loudness of the sound source producing the sound.
Moore's loudness model attempts to provide an objective measure of the subjectively perceived loudness. It predicts the loudness of a given sound as the sum of the loudness of each critical band, where the loudness of each critical band is computed as an energy summation of the signal content in the critical band. The model includes the level compression performed by the auditory system (Moore and Glasberg 2004). A simplified version of the model is:L=Σc=1C|√{square root over (ΣkεK(c)|F(A(k))|2)}{square root over (ΣkεK(c)|F(A(k))|2)}|,  (1)where L is the loudness in dB, C is the number of critical bands, K(c) the set of centre frequencies within each of the critical bands, F the compressive cochlear function, and A the magnitude of the spectrum within the respective critical band. The applicability of the model requires that the spectrum be sampled with sufficient frequency resolution relative to the critical bandwidths. Moore's loudness model does not include distance compensation and does thus not predict the apparent loudness.
In an earlier article, Chowning disclosed a method for synthesising sounds of musical instruments, wherein the sounds are generated by means of combined frequency modulation (FM) and amplitude modulation (AM) (Chowning 1973). The modulation is controlled by a set of parameters, which specify e.g. the duration of the sound, the amplitude, the carrier frequency, the modulating frequency and the frequency modulation index (FM index). Chowning found out that the vividness of some synthesised instrument sounds, particularly of synthesised brass instrument sounds, could be substantially improved by varying the FM index over time. The proposed variations are relatively simple, e.g. linear, exponential and hyperbolic shifts, and are obtained by generating an FM index signal in a generator controlled by a few parameters in the parameter set. Varying the FM index over time has a substantial impact on the time variation of the synthesised sound spectra, and Chowning hypothesises that the general character of the evolution of the frequency components over time is more important for the subjective impression of the synthesised sounds than the amplitude curve for each frequency component. Chowning further disclosed multiple parameter sets, which may be used to achieve realistic synthesis of several different types of musical instruments. Starting and/or ending points for the time-varied modulation indices are typically about unity or larger. Later, Chowning improved the synthesis of voiced sounds using the same method but different modulation signals (Chowning 1980).
Lazzarini and Timoney disclosed a variant of the above mentioned FM synthesis, called modified frequency modulation (ModFM) (Lazzarini and Timoney 2010). ModFM is based on a modified version of the classic FM formula and produces frequency-modulated signals wherein the distribution of spectral components varies with a more predictable dependence on the frequency modulation index than in the classic FM. This allows ModFM to provide a more naturally-sounding synthesis of musical instruments.
As Chowning also pointed out, the principles of FM and the influence of the FM index on the spectral content of the modulated signals are well known from the field of radio signal transmission. In this field, frequency modulation with modulation indices above unity is generally known as “wideband frequency modulation”.
The simplest form of oral communication involves a speaking person (the speaker) and a listening person (the listener). The speaker transforms a message into speech, i.e. sound, and transmits the speech into the air. In the air, the speech is normally mixed with other sounds before it reaches the listener's ears. In order to understand the message, the listener thus has to derive or decode it from the mixture of sounds. Errors in the decoding process may obviously lead to misinterpretation of the message.
The physical generation of speech is a complex process, which among others involves the larynx with the vocal cords and the vocal tract of the speaker. Current state of art suggests that slow, correlated FM and AM are produced in natural speech (Teager 1980; Teager and Teager 1990; Bovik, Maragos and Quatieri 1993; Maragos, Kaiser and Quatieri 1993 A; Maragos, Kaiser and Quatieri 1993 B; Zhou, Hansen and Kaiser 2001), and that the FM cues are important for allowing normal-hearing listeners to decode speech in situations with negative SNR, whereas FM extraction may be impaired among people with cochlear impairment (Moore and Skrodzka 2002; Heinz et al. 2010). Hearing-impaired listeners can, however, utilise the AM cues (Hopkins, Moore and Stone 2008).
In situations with competing sounds, speakers tend to modify their voice to increase the clarity of their voice. This is usually referred to as vocal effort, Clear speech (Lindblom 1996) or Lombard effect after Etienne Lombard who discovered the effect in 1909. Lindblom reports that in short-duration vowels, the centre frequency of the second formant deviates from its target value (Lindblom 1996). Folk reports that the average and the dynamic range of the fundamental frequency (f0) increases with rising noise level, as do the average intensity and the dynamic range of the intensity, while the speaking rate decreases (Folk and Schiel 2011). For many natural sounds, increased intensity is also accompanied by increased bandwidth (Chowning 2000). This dependency of the speaker's voice on the noise level presents a major challenge for Automatic Speech Recognition (ASR). For instance, ASR systems cannot be reliably tested simply by feeding them with sounds mixed from clean-speech libraries and noise libraries (Winkler 2011).
Potamianos and Maragos disclosed methods for speech analysis and synthesis, wherein speech is modelled by a sum of AM-FM modulated signals, each signal representing a speech formant (Potamianos and Maragos 1999).
It is further known from hearing aids of the cochlear-implant (CI) type that the audio signal is made available to the hearing-aid user by extracting FM information and presenting this information in an FM modulated carrier signal with a relatively narrow bandwidth (Nie, Stickney and Zeng 2005; Zeng et al. 2005; Zeng and Nie 2007).