The field of the present invention is related to an audio mixer for mixing of multi-track signals according to user specifications. It is related to audio signal processing, in particular to the task of mixing a multi-track recording according to a set of user-defined criteria. The field of the invention is further related to a method for mixing a plurality of audio tracks to a mixture signal. The field of the invention is also related to a computer program for instructing a computer to perform the method for mixing a plurality of audio tracks.
The ever-growing availability of multimedia content yields new ways for the user to enjoy music and to interact with music. These possibilities are accompanied by the challenge to develop the tools for assisting the user in such activities.
From the perspective of information retrieval, this challenge has been taken more than a decade ago, leading to the vibrant research area of music information retrieval and numerous commercial applications.
A different aspect which has not been addressed to this extent is the interaction with content which is available in a multi-track format. A multi-track format can consist of separate and time-aligned signals (also known as single tracks (ST)) for each sound object (SO) or groups of objects (stems). According to one definition, stems are the individual components of a mix, separately saved (usually to disc or tape) for the purpose of use in a remix.
In the traditional process of music production, multiple single tracks are combined in a sophisticated manner into a mixture signal (MS) which is then delivered to the end user. The ongoing evolution of digital audio technologies, e.g. the development of new audio formats for parametric object-based audio, enables the interaction with music to a much larger extent. The user has access to multi-track recordings and can actively control the mixing process. Some artists have begun releasing the stems for some of their songs, the intention being that listeners can freely remix and reuse the music in any way desired.
A musical or audio work released in multi-track format can be used in numerous ways. The user may control the mixing parameters for the different tracks, thus emphasising selected tracks while attenuating other tracks. One or more tracks may be muted, for example for the purposes of karaoke or play-along. Sound effects, such as echo, reverberation, distortion, chorus etc., may be applied to selected tracks without affecting the other tracks. One or more tracks may be excerpted from the multi-track format and can be used in another musical work or another form of audio work, such as an audio book, a lecture, a podcast, etc. In the following description, an application of the teachings disclosed herein discusses, in an exemplary manner, the mastering of a recorded musical work. It should be understood, however, that the processing of any recorded sound involving mixing a plurality of single audio tracks is intended to be equally addressed and covered by the teachings disclosed herein.
Automatic mixing has been, and still is, the focus of a number of research projects. In 2009, Perez-Gonzalez et al. described a method for automatic equalization of multi-track signals (E. Perez-Gonzalez and J. Reiss, “Automatic Equalization of Multi-Channel Audio Using Cross-Adaptive Methods”, Proc. of the AES 127th Conv., 2009). The authors present a method for automatically setting the attenuation for each signal of a multi-track signal. The gains are determined such that the loudness of each signal equals the average loudness of all signals. Another article by the same authors addressed “Automatic Gain and Fader Control for Live Mixing” and was published in Proc. of WASPAA, 2009.
Semantic HiFi is the name of the European Project IST-507913 (H. Vinet et al., “Semantic HiFi Final Report”, Final Report of IST-507913, 2006). It is mainly related to the retrieval, browsing, and sharing of multimedia content. This comprises browsing and navigating in databases, playlist generation, intra-track navigation (using structural analysis like verse-chorus identification), and meta-data sharing. It also addresses the interaction/authoring/editing: generating mixes including synchronization (that is “concatenating” audio signals, not mixing multi-track signals), voice transformation, rhythm transformation, voice controlled instruments, and effects.
Another project is known under the designation “Structured Audio” or MPEG 4. Structured Audio enables the transmission of audio signals at low bit-rates and perceptually based manipulation and access of sonic data using symbolic and semantic description of the signals (cf. B. L. Vercoe and W. G. Gardner and E. D. Scheirer, “Structured Audio: Creation, Transmission, and Rendering of Parametric Sound Representations”, Proc. of IEEE, vol. 86, pp. 922-940, 1998). It features a description of parametric sound post-production for mixing multiple streams and adding audio effects. The parametric descriptions determine how the sounds are synthesized. Structured audio is related to synthesizing audio signals.
In the international patent application published under international publication number WO 2010/111373 A1, a context aware, speech-controlled interface and system is disclosed. The speech-directed user interface system includes at least one speaker for delivering an audio signal to a user and at least one microphone for capturing speech utterances of a user. An interface device interfaces with the speaker and the microphone and provides a plurality of audio signals to the speaker to be heard by the user. A control circuit is operably coupled with the interface device and is configured for selecting at least one of the plurality of audio signals as a foreground audio signal for delivery to the user through the speaker. The control circuit is operable for recognizing speech utterances of a user and using the recognized speech utterances to control the selection of the foreground audio signal.
United States Patent Application Publication No. US 2002/0087310 A1 discloses a computer-implemented method and system for handling a speech dialogue with a user. Speech input from a user contains words directed to a plurality of concepts. The user speech input contains a request for a service to be performed. Speech recognition of the user speech input is used to generate recognized words. A dialogue template is applied to the recognized words. The dialogue template has nodes that are associated with predetermined concepts. The nodes include different request processing information. Conceptual regions are identified within the dialogue template based upon which nodes are associated with concepts that approximately match the concepts of the recognized words. The user's request is processed by using the request processing information of the nodes contained within the identified conceptual regions.
The article “Transient Detection of Audio Signals Based on an Adaptive Comb Filter in the Frequency Domain”, M. Kwong and R. Lefebvre presents a transient detection algorithm suitable for rhythm detection in music signals. In many audio signals, low energy transients are masked by high energy stationary sounds. These masked transients, as well as higher energy and more visible transients, convey important information on the rhythm and time segmentation of the music signal. The proposed segmentation algorithm uses a sinusoidal model combined with adaptive comb filtering in the frequency domain to remove the stationary component of a sound signal. After filtering, the time envelope of the residual signal is analyzed to locate the transient components. Results show that the proposed algorithm can accurately detect most low energy transients.
The mixing of a multi-track recording typically is an authoring task which is usually done by an expert, the mixing engineer. Current developments in multimedia like interactive audio formats lead to applications where multi-track recordings need to be mixed in an automated way or in a semi-automated way guided by a non-expert. It is desired that the automatically derived mixture signal has comparable subjective sound quality to a mixture signal generated by a human expert.
The teachings disclosed herein address this general goal. The teachings are related to audio signal processing, in particular the task of mixing a multi-track according to a set of user-defined recording criteria for the (eventual) purpose of listening. An audio mixer and a method for mixing a plurality of audio tracks to a mixture signal according to the teachings disclosed herein establish a connection between a substantially aesthetic idea of a non-expert and the resulting mixture signal.