1. Field of the Invention
The invention pertains to systems and methods for controlling the levels of audio objects (indicated by an audio signal) to object-dependent target levels at object-dependent rates. Typical embodiments are systems and methods for controlling the levels of audio objects (indicated by an audio signal) at object-dependent rates, where the audio objects include voices of voice conference participants, and the object-dependent rates and/or target levels depend on stored information regarding each distinct audio object.
2. Background of the Invention
Throughout this disclosure, including in the claims, the terms “speech” and “voice” are used interchangeably, in a broad sense to denote audio content perceived as a form of communication by a human being. Thus, “speech” determined or indicated by an audio signal may be audio content of the signal which is perceived as a human utterance upon reproduction of the signal by a loudspeaker (or other sound-emitting transducer).
Throughout this disclosure, including in the claims, the expression “segment” of an audio signal assumes that the signal has a first duration, and denotes a segment of the signal having a second duration less than the first duration. For example, if the signal has a waveform of a first duration, a segment of the signal has a waveform whose duration is shorter than the first duration.
Throughout this disclosure, including in the claims, the expression performing an operation “on” a signal or data (e.g., filtering, scaling, transforming, or applying gain to, the signal or data) is used in a broad sense to denote performing the operation directly on the signal or data, or on a processed version of the signal or data (e.g., on a version of the signal that has undergone preliminary filtering or pre-processing prior to performance of the operation thereon).
Throughout this disclosure including in the claims, the expression “system” is used in a broad sense to denote a device, system, or subsystem. For example, a subsystem that implements a decoder may be referred to as a decoder system, and a system including such a subsystem (e.g., a system that generates X output signals in response to multiple inputs, in which the subsystem generates M of the inputs and the other X-M inputs are received from an external source) may also be referred to as a decoder system.
Throughout this disclosure including in the claims, the term “processor” is used in a broad sense to denote a system or device programmable or otherwise configurable (e.g., with software or firmware) to perform operations on data (e.g., video or other image data). Examples of processors include a field-programmable gate array (or other configurable integrated circuit or chip set), a digital signal processor programmed and/or otherwise configured to perform pipelined processing on audio or other sound data, a programmable general purpose processor or computer, and a programmable microprocessor chip or chip set.
In business meetings in which audio signals (e.g., audio signals delivered by communication systems) indicative of participant speech are reproduced, an important component of the audio processing of the signals is leveling of segments of the signals which are indicative of speech of different talkers. People speak at various levels in a meeting and it is typically necessary for an audio processing system to actively adjust the levels of different segments of an audio signal to ensure that the perceived loudness of each participant's speech is consistent.
Conventional leveling systems typically employ automatic gain control (AGC) to regulate levels. However, such systems typically have fixed time constants that govern how quickly to attenuate or increase the leveling gain. For a single talker (such as on a telephone) this may suffice. However, when multiple talkers are involved (such as in a business conference call), fixed time constants result in acoustically unnatural artifacts such as voices fading in and out due to rapid changes in level due to talker switching. Additionally, simple gain control systems can be disturbed by sudden loud activities after which time the desired signal activity or voice is attenuated until the gain again recovers, and this can be a noticeable and undesirable duration.
FIG. 1 is a block diagram of a simple conventional AGC leveling system integrated with an audio pre-processing subsystem (pre-processor 1) for pre-processing the audio signal to be leveled. Level calculation subsystem 3 performs voice detection on the pre-processed audio signal (or alternatively, on the audio signal asserted to the input of pre-processor 1) to identify at least one voice segment thereof, and for each voice segment, determines an estimated voice level for the segment. Alternatively, pre-processor 1 performs voice detection on the input audio signal to identify at least one voice segment thereof, and for each voice segment, subsystem 3 determines an estimated voice level for the segment. The leveling is achieved by determining (in subsystem 5) an updated gain for each voice segment of the pre-processed signal output from subsystem 1, and applying (in gain stage 7) the updated gain to the corresponding voice segment. Thus, stage 7 modifies each voice segment such that estimated voice level (determined in subsystem 3) for the segment is shifted to a predetermined target level at the output of stage 7. Optionally, subsystem 3 does not perform voice detection and instead determines an estimated level for each segment of the audio signal asserted to its input, and stage 7 modifies each segment such that estimated level (determined in subsystem 3) for the segment is shifted to a predetermined target level.
FIG. 2 is a block diagram of an alternative conventional leveling system which is identical to the FIG. 1 system except in that the gain to achieve the desired output voice level is applied (in gain stage 9) to the input signal asserted to pre-processor 1 (rather than to the output of pre-processor 1), creating a feedback control loop. In contrast, the FIG. 1 system (and the FIG. 3 embodiment of the inventive system, to be described below) implements open gain application at the output stage.
It is well known how to implement pre-processing (e.g., in subsystem 1 of the system of FIG. 1 or FIG. 2, or in element 21 of the FIG. 3 system to be described below) including by performing instantaneous gain control to suppress noise and enhance voice activity (a function often referred to as noise suppression or noise reduction). Conventional noise reduction typically distinguishes between voice and noise components of the input audio signal, and applies greater gain to each identified voice component than to each identified noise component.
It is also conventional to analyze (e.g., by applying statistical analysis to) an audio signal indicative of a multiple microphone soundfield capture, to segment the signal, and to identify audio objects indicated by the signal (e.g., an audio object indicated by each segment of the signal). Each segment (which may be a stream of audio data samples) may be identified as being indicative of sound emitted from a specific source or set of sources. It is conventional to determine a scene “map” (or “scene description” or “sound scene”) comprising data describing each audio object identified from an audio signal (e.g., data indicating a type or source of each object, and a location or trajectory of at least one source which emits the sound comprising the object). An example of an audio object is sound emitted from a specific source (e.g., voice uttered by a specific person). It has also been proposed to use such a scene map to manipulate the soundfield. For example, U.S. Pat. No. 7,876,914, issued Jan. 25, 2011, describes generation of a modified audio signal in response to a first audio signal, where the first audio signal is indicative of a sound scene, the sound scene is indicative of multiple sources, and the modified audio signal is indicative of a virtual microphone tour along a selected path through the sound scene.