Auditory Events and Auditory Event Detection
The division of sounds into units or segments perceived as separate and distinct is sometimes referred to as “auditory event analysis” or “auditory scene analysis” (“ASA”). The segments are sometimes referred to as “auditory events” or “audio events.” Albert S. Bregman, “Auditory Scene Analysis—The Perceptual Organization of Sound” (Massachusetts Institute of Technology, 1991, Fourth printing, 2001, Second MIT Press paperback edition) extensively discusses auditory scene analysis. In addition, Bhadkamkar et al., U.S. Pat. No. 6,002,776 (Dec. 14, 1999) cites publications dating back to 1976 as “prior art work related to sound separation by auditory scene analysis.” However, Bhadkamkar et al. discourages the practical use of auditory scene analysis, concluding that “[t]echniques involving auditory scene analysis, although interesting from a scientific point of view as models of human auditory processing, are currently far too computationally demanding and specialized to be considered practical techniques for sound separation until fundamental progress is made.”
Crockett and Crocket et al. in the various patent applications and papers listed above identify auditory events. Those documents teach dividing an audio signal into auditory events (each tending to be perceived as separate and distinct) by detecting changes in spectral composition (amplitude as a function of frequency) with respect to time. This may be done, for example, by calculating the spectral content of successive time blocks of the audio signal, comparing the spectral content between successive time blocks and identifying an auditory event boundary as the boundary between blocks where the difference in the spectral content exceeds a threshold. Alternatively, changes in amplitude with respect to time may be calculated instead of or in addition to changes in spectral composition with respect to time.
The auditory event boundary markers are often arranged into a temporal control signal whereby the range, typically zero to one, indicates the strength of the event boundary. Furthermore this control signal is often filtered such that event boundary strength remains, and time intervals between the events boundaries are calculated as decaying values of the preceding event boundary. This filtered auditory event strength is then used by other audio processing methods including automatic gain control and dynamic range control.
Dynamics Processing of Audio
The techniques of automatic gain control (AGC) and dynamic range control (DRC) are well known and common in many audio signal paths. In an abstract sense, both techniques measure the level of an audio signal and then gain-modify the signal by an amount that is a function of the measured level. In a linear, 1:1 dynamics processing system, the input audio is not processed and the output audio signal ideally matches the input audio signal. Additionally, imagine an audio dynamics processing system that automatically measures the input signal and controls the output signal with that measurement. If the input signal rises in level by 6 dB and the processed output signal rises in level by only 3 dB, then the output signal has been compressed by a ratio of 2:1 with respect to the input signal.
In Crockett and Seefeldt, auditory scene analysis improves the performance of AGC and DRC methods by minimizing the change in gain between auditory event boundaries, and confining much of the gain change to the neighborhood of an event boundary. It does this by modifying the dynamics-processing release behavior. In this way, auditory events sound consistent and natural.
Notes played on a piano are an example. With conventional AGC or DRC methods, the gain applied to the audio signal increases during the tail of each note, causing each note to swell unnaturally. With auditory scene analysis, the AGC or DRC gain is held constant within each note and changes only near the onset of each note where an auditory event boundary is detected. The resulting gain-adjusted audio signal sounds natural as the tail of each note dies away.
Typical implementations of auditory scene analysis (as in the references above) are deliberately level invariant. That is, they detect auditory event boundaries regardless of absolute signal level. While level invariance is useful in many applications, some auditory scene analyses benefit from some level dependence.
One such case is the method described in Crockett and Seefeldt. There, ASA control of AGC and DRC prevents large gain changes between auditory event boundaries. However, longer-term gain changes can still be undesirable on some types of audio signals. When an audio signal goes from a louder to a quieter section, the AGC or DRC gain, constrained to change only near event boundaries, may allow the level of the processing audio signal to rise undesirably and unnaturally during the quiet section. This situation occurs frequently in films where sporadic dialog alternates with quiet background sounds. Because the quiet background audio signal also contains auditory events, the AGC or DRC gain is changed near those event boundaries, and the overall audio signal level rises.
Simply weighting the importance of auditory events by a measure of the audio signal level, power or loudness is undesirable. In many situations the relationship between the signal measure and absolute reproduction level is not known. Ideally, a measure discriminating or detecting perceptually quieter audio signals independent of the absolute level of the audio signal would be useful.
Here, “perceptually quieter” refers not to quieter on an objective loudness measure (as in Seefeldt et al. and Seefeldt) but rather quieter based on the expected loudness of the content. For example, human experience indicates that a whisper is a quiet sound. If a dynamics processing system measures this to be quiet and consequently increases the AGC gain to achieve some nominal output loudness or level, the resulting gain-adjusted whisper would be louder than experience says it should be.