Psychoacoustic modeling is a heavily researched field of signal processing for machine modeling of the human auditory system. The human ear transforms sound pressure waves traveling through air into nerve pulses sent to the brain, where the sound is perceived. While one individual's ability to perceive sounds and differences between sounds differs from one person to the next, researchers in the field of psychoacoustics have developed generalized models of the human auditory system (HAS) through extensive listening tests. These tests produce audibility measurements, which in turn, have led to the construction of perceptual models that estimate a typical human listener's ability to perceive sounds and difference between sounds.
These models derived from human listening tests, in turn, are adapted for use in automated signal processing methods in which a programmed processor or signal processing circuit estimates audibility from audio signals. This audibility is specified in terms of sound quantities like sound pressure, energy or intensity, or a ratio of such quantities to a reference level (e.g., decibels (dB)), at frequencies and at time interval. A common way of representing in a machine the limits of what a human can hear is a hearing threshold indicating a level under which a particular sound is estimated by the machine to be imperceptible to humans. The threshold is often relative to a particular signal level, such as level at which a sound is imperceptible relative to another sound. The threshold need not be relative to a reference signal, but instead, may simply provide a threshold level, e.g., an energy level, indicating the level below which, sounds are predicted to not be perceptible by a human listener.
The intensity range of human hearing is quite large. The human ear can detect pressure changes from as small as a few micropascals to greater than 1 bar. As such, sound pressure level is often measured logarithmically, with pressures referenced to 20μ Pascals (Pa). The lower limit of audibility is defined as 0 dB. The following logarithmic unit of sound intensity ratios is commonly used in psychoacoustics:
  SPL  =            10      ⁢                          ⁢              log        10            ⁢              I                  I          ref                      =          20      ⁢                          ⁢              log        10            ⁢              p                  p          ref                    
Here, p is the sound pressure and pref is the reference sound pressure usually selected to be 20 μPa, which is roughly equal to the sound pressure at the threshold of hearing for frequencies around 4 kHz. The variable I is the sound intensity, and it is usually taken to be the square of the magnitude at that frequency component. In order to compute the SPL, the exact playback level of the audio signal should be known. This is usually not the case in practice. Hence, it is assumed that the smallest signal intensity that can be represented by the audio system (e.g., least significant bit or LSB of a digitized or quantized audio signal) corresponds to an SPL of 0 dB in the hearing threshold or threshold in quiet. A 0 dB SPL is found in the vicinity of 4 kHz in the threshold of hearing curve. Implementations of psychoacoustic models sometimes convert audio signal intensity to SPL, but need not do so. Where necessary, audio signal intensity may be converted to the SPL for processing in the SPL domain, and the result may then be converted back to intensity.
In some applications of psychoacoustic modeling, the absolute threshold of hearing is also used to predict audibility of a sound. The minimum threshold at which a sound can be heard is frequency dependent and is expressed as an absolute threshold of hearing (ATH) curve of thresholds varying with frequency. Automated psychoacoustic modeling applies this minimum threshold curve by assuming that any sound measured to be below it is inaudible. However, such automated application of ATH sometimes involves assumptions on the volume levels used for playback. If these assumptions do not hold, there is a risk that the distortions made to an audio signal in a digital signal processing operation based on the assumptions will cause unwanted audible artifacts.
Frequency scales derived from listening experiments are approximately logarithmic in frequency at the high frequencies and approximately linear at the low end. The frequency range of human hearing is about 20 to 20 kHz. The variation of the scale over frequency is intended to correspond approximately to the way in which the ear perceives differences among sounds at neighboring frequencies. A couple of examples of these scales are the mel scale and the Bark scale. The underlying theory for these frequency scales used in psychoacoustics originated, in part, with Fletcher's study of critical bands of the human ear. A critical bandwidth refers to the frequency bandwidth of an “auditory filter” created by the cochlea, the sense organ within the inner ear. Generally speaking, the critical band is comprised of the group of neighboring frequencies (a “band”) within which a second tone will interfere with the perception of a first tone by auditory masking. The auditory filters are an array of overlapping bandpass filters that model the sensitivity of different points along the basilar membrane to frequency ranges.
Another concept associated with the auditory filter is the equivalent rectangular bandwidth (ERB). The ERB is a way of expressing the relationship between the auditory filter, frequency, and the critical bandwidth. According to Moore (please see, B. C. J. Moore, An Introduction to the Psychology of Hearing, Emerald Group Publishing Limited, Fifth Edition, 2004, pp. 69, 73-74), the more recent measurements of critical bandwidths are referred to as ERB to distinguish them from the older critical bandwidth measurements which were obtained on the basis of the assumption that auditory filters are rectangular. An ERB passes the same amount of energy as the auditory filter it corresponds to and shows how it changes with input frequency.
A significant aspect of HAS modeling, in particular, is modeling masking effects. Masking effects refer to the phenomena of psychoacoustics in which an otherwise audible sound is masked by another sound. Temporal masking refers to a sound masking sounds that occur before or after it in time. Simultaneous masking refers to sounds that mask sounds occurring approximately together in frequency, based on rationale similar to critical bands and subsequent research. It is often modeled through frequency domain analysis where sound types, such as a tone or noise-like sound, mask another tone or noise like sound.
Within this document, we refer to sounds that mask other sounds as “maskers,” and sounds that are masked by other sounds as “maskees.” Most real world audio signals are complex sounds, meaning that they are composed of multiple maskers and multiple maskees. Within these complex sounds, many of the maskees are above the masking threshold
Despite extensive research and application of HAS models, masking phenomenon of complex sounds is still poorly understood. In ongoing research, there is controversy in the interpretation of masking even for the simplest case of several individually spaced sinusoids in the presence of background noise. Even for this case, there is a lack of clarity as to whether or not the presence of multiple maskers within a local frequency neighborhood not exceeding the critical bandwidth, or the ERB, will increase the masking threshold due to a cumulative effect or does not noticeably alter it. For additional information, please see, B. C. J. Moore, An Introduction to the Psychology of Hearing, Emerald Group Publishing Limited, Fifth Edition, 2004, pp. 78-83. Recent research has demonstrated the role of several perceptual attributes of maskers in influencing the nature of masking. Some of these attributes include saliency of masker, nature of masker intensity fluctuations across frequency, inter-aural disparities, and so on as described in K. Egger, Perception and Neural Representation of Suprathreshold Signals in the Presence of Complex Maskers, Diploma Thesis, Graz University of Technology, 2012. Inadequate understanding of the masking phenomenon of complex signals is a key reason for the discrepancy behind the actual expert-level (“golden ears”) perception of masking and the masking thresholds obtained by state-of-art psychoacoustic models.
Masking is generally applied using a warped frequency scale such as the bark scale or the ERB scale, both of which correspond better to the frequency processing inherent in the human auditory system compared to the linear frequency scale. The state-of-art audio perceptual models approximate the masking of complex sounds by either decimating (eliminating less dominant) maskers occurring within a local frequency neighborhood or by partitioning the frequency space and pooling (usually additively) the signal energy within a partition to create a single masker per partition. Both of these approaches lead to a coarse representation of the final mask due to a reduction in the frequency resolution of the mask generation process. The loss in frequency resolution often manifests itself as roughness in the sound perception.
One aspect of the invention is a method for generating a psychoacoustic model from an audio signal. In this method, the masking energy derived for a group of frequency components is allocated to components within the group in a process referred to as “Energy Adaptation.” In this method, a block of samples of an audio signal is transformed into a frequency spectrum comprising frequency components. From the frequency spectrum, the method derives group masking energies. The group masking energies each correspond to a group of neighboring frequency components in the frequency spectrum. For each of plural groups of neighboring frequency components, the method allocates the group masking energy to the frequency components in a corresponding group in proportion to energy of the frequency components within the corresponding group. The output of this process is comprised of adapted mask energies for the frequency components within each group. These adapted mask energies provide masking thresholds for the psychoacoustic model of the audio signal.
The allocation of masking energy within a group is preferably adapted according to an analysis of the distribution of energy of the frequency components in the group. Allocations of masking energy are adapted based on the extent to which frequency components are highly varying (e.g., spiky). For example, one implementation assesses the distribution by determining the variance and a group average of the energies of the frequency components within a group. In a group where variance exceeds a threshold, this method compares the adapted mask energies of frequency components with group average. For frequency components in the group with adapted mask energy that exceeds the group average, the method sets the group average as a masking threshold for the frequency component.
There are a variety of applications where this energy adaptation provides improved performance. Generally speaking, the method provides an effective means for machine estimation of audibility of audio signals and audio signal processing operations on an input audio signal. These audibility assessments, in particular, provide for improved audio compression and improved digital watermarking, in which auxiliary digital data is encoded using the model to achieve desired robustness and perceptual quality constraints. In these applications, the adapted masking thresholds for frequency components are applied to control audibility of changes in an audio signal.
Further features and advantages will become apparent from the following detailed description and accompanying drawings.