Many multimedia editing applications use a background audio signal, such as music or ambiance sounds, behind a foreground audio signal, such as a dialogue track. For the dialogue to be intelligible, the loudness of the background audio signal should be much lower than that of the foreground audio signal. If the foreground signal is interrupted by pauses, such as a pause in the spoken dialogue, the loudness of the background audio signal can be increased to make the overall sound more interesting. This process of changing the loudness of the background audio signal based on the state for the foreground audio signal is referred to in the industry as “audio ducking.”
Existing systems perform audio ducking by adding key frames to indicate an increase or decrease in the loudness of the background audio signal. Adding the key frames is performed in one of two techniques: manual or automated. Both techniques are generally complex and may not produce accurate audio ducking.
More specifically, in existing manual key frame techniques, a user manually places key frames on the background audio signal based on the foreground audio signal. This can be done by eyeballing the waveform of the foreground audio signal or by listening to the foreground audio signal and placing the key frames while listening. Unless a trained user performs this process, the resulting audio ducking can be inaccurate. Further, even if a trained user performs the process, the overall workflow is tedious, un-scalable, and time consuming.
Existing automated key frame techniques can also be inaccurate, inefficient, or both. In one approach, such existing techniques break the foreground audio signal into multiple audio clips. Key frames are then added at the start and end of each of the audio clips. However, because this approach does not rely on the actual dialogue in each of the foreground audio clips, the audio ducking can be inaccurate. For example, if an audio clip starts with a long pause before an actual dialogue, the existing systems would add a key frame at the start of the audio clip and a long time before the actual dialogue. Hence, the audio ducking would result in a long period of silence before the actual dialogue starts. In another example, the key frame generation is automated by recording a volume change during real-time mixing and a key frame is added at a location of a volume change. Although this approach can be more accurate, it can be computationally inefficient because each time there is a change to the audio mixing, the key frame generation needs to be repeated altogether.
In another approach, the existing automated key frame techniques use side chaining. Generally, side chaining refers to routing the foreground audio signal to a side chain of an audio processor that controls the background audio signal. When the volume of the audio signal in the side chain is high enough, the audio processor reduces the volume of the background audio. However, existing side chaining systems can only perform accurate ducking if a delay is introduced such that the volume of the background music is reduced a little before the increase to the volume in the side chain. Further, configuring such systems can be complex because of the complexity to set-up the routing of the foreground signal to the side chain and to fine tune the audio processor.