1. Field of the Invention
Embodiments of the invention are systems and methods for determining the level of speech determined by an audio signal in a manner which corrects for, and thus reduces the effect of (is invariant to, in preferred embodiments) modification of the signal by addition of noise thereto and/or amplitude compression thereof.
2. Background of the Invention
Throughout this disclosure, including in the claims, the terms “speech” and “voice” are used interchangeably, in a broad sense to denote audio content perceived as a form of communication by a human being. Thus, “speech” determined or indicated by an audio signal may be audio content of the signal which is perceived as a human utterance upon reproduction of the signal by a loudspeaker (or other sound-emitting transducer).
Throughout this disclosure, including in the claims, the expression “speech data” (or “voice data”) denotes audio data indicative of speech, and the expression “speech signal” (or “voice signal”) denotes an audio signal indicative of speech (e.g., which has content which is perceived as a human utterance upon reproduction of the signal by a loudspeaker).
Throughout this disclosure, including in the claims, the expression “segment” of an audio signal assumes that the signal has a first duration, and denotes a segment of the signal having a second duration less than the first duration. For example, if the signal has a waveform of a first duration, a segment of the signal has a waveform whose duration is shorter than the first duration.
Throughout this disclosure, including in the claims, the expression performing an operation “on” signals or data (e.g., filtering, scaling, or transforming the signals or data) is used in a broad sense to denote performing the operation directly on the signals or data, or on processed versions of the signals or data (e.g., on versions of the signals that have undergone preliminary filtering prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X−M inputs are received from an external source) may also be referred to as a decoder system.
Throughout this disclosure including in the claims, the term “processor” is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., video or other image data). Examples of processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set.
The accurate estimation of speech level is an important signal processing component in many systems. It is used, for example, as the feedback signal for the automatic control of gain in many communications system, and in broadcast it is used to determine and assign appropriate playback levels to program material.
Examples of conventional methods for estimating the loudness (level) of speech determined by an audio signal are described in Soulodre et al., “Objective Measures of Loudness,” presented at the 115th Audio Engineering Society Convention, 2003 (“Soulodre”).
Typical conventional speech level estimation methods operate on frequency domain audio data (indicative of an audio signal) to determine loudness levels for individual frequency bands of the audio signal. The levels then typically undergo perceptually relevant weighting (which attempts to model the transfer characteristics of the human auditory system) to determine weighted levels (the levels for some frequency bands are weighted more heavily than for some other frequency bands). For example, Soulodre discusses several types of conventional weightings of this type, including A-, B-, C-, RLB (Revised Low-frequency B), Bhp (Butterworth high-pass filter), and ATH weightings. Other conventional perceptually relevant weightings include D-weightings and M (Dolby) weightings.
As described in Soulodre, the weighted levels are typically summed and averaged over time to determine an equivalent sound level (sometimes referred to as “Leq”) for each segment (e.g., frame, or N frames, where N is some number) of input audio data. For example, the level “Leq” may be computed as follows: a set of values (xW)2/(xREF)2 is determined, where each value xW is the weighted loudness level corresponding to (e.g., produced at) a time, t, during the segment (so that each value xW is a weighted loudness level for one of the frequency bands), and XREF is a reference level for the frequency band; and Leq for the segment is computed to be Leq=10 log10(I/T), where I is the integral of the (xW)2/(xREF)2 values over a time interval T, and T is of sufficient duration to include the times associated with the values (xW)2/(xREF)2 for all the frequency bands.
However, in traditional methods and systems for measuring the level of a speech signal (e.g., a voice segment of an audio signal), the calculated level (e.g., Soulodre's “Leq”) is highly dependent on the signal-to-noise ratio (SNR) of the signal and the type of amplitude compression applied to the signal. To appreciate this, consider a speech signal segment that has been compressed with various compression ratios, and noisy versions of each compressed version of the sample (having various different signal to noise ratios). The speech levels (Leq) determined by the conventional loudness estimating method described in Soulodre for such compressed, noisy samples would show a significant bias due to the presence of the signal modification (compression and noise).
For an example, consider FIG. 1, which is a graph of results of applying a conventional speech level estimating method to a range of input voice signals with varying levels and signal to noise ratio. For input voice signals having constant perceptual speech level, the conventionally estimated level has a strong bias determined by the signal to noise ratio, in the sense that the conventionally measured level increases as the signal to noise ratio decreases. In FIG. 1, the error in dB (plotted on the vertical axis) denotes the discrepancy between the conventionally measured (estimated) speech level and a reference RMS voice level calculated in the absence of noise. Thus, the graph shows that the conventionally measured level increases relative to the reference RMS voice level, as the signal to noise ratio decreases.