In the following, an introduction will be given.
Introduction
Multi-channel audio material is becoming more and more popular also in the consumer home environment. This is mainly due to the fact that movies on DVD offer 5.1 multi-channel sounds and therefore even home users frequently install audio playback systems, which are capable of reproducing multi-channel audio.
Such a setup may e.g. consist of three speakers (L, C, R) in the front, two speakers (Ls, Rs) in the back and one low frequency effects channel (LFE). For convenience, the given explanations are related to 5.1 systems. They apply to any other multi-channel systems with minor modifications.
Multi-channel systems provide several well-known advantages over two-channel stereo reproduction, e.g.:                Advantage 1: Improved front image stability even off the optimal (central) listening position. Due to the center channel the “sweet-spot” is enlarged. The term “sweet-spot” denotes the area of listening positions where an optimal sound impression is perceived.        Advantage 2: An increased experience of “envelopment” and spaciousness is created by the rear channel speakers.        
Nevertheless, there exists a huge amount of legacy audio content with two audio channels (“stereo”) or even only one (“mono”), e.g. old movies and television series.
Recently, various methods for generating a multi-channel signal from an audio signal with fewer channels have been developed (see Section 2 for an overview of the related conventional concepts). The process of generating a multi-channel signal from an audio signal with fewer channels is called “upmixing”.
Two Concepts of Upmixing are Widely Known.
1. Upmixing with additional information guiding the upmix process. The additional information may be either “encoded” in a specific way in the input signal or may be stored additionally. This concept is frequently called “guided upmix”.
2. The “blind upmix”, whereas a multi-channel signal is obtained from the audio signal exclusively without any additional information.
Embodiments according to the present invention are related to the latter, i.e. the blind upmix process.
In the literature, an alternative taxonomy for upmix processes is reported. Upmix processes may follow either the Direct/Ambient-Concept or the “In-the-band”-Concept or a mixture of both. These two concepts are described in the following.
A. Direct/Ambient-Concept
The “direct sound sources” are reproduced through the three front channels in a way that they are perceived at the same position as in the original two-channel version. The term “direct sound source” is used to describe a sound coming solely and directly from one discrete sound source (e.g. an instrument), with little or without any additional sounds, e.g. due to reflections from the walls.
The rear speakers are fed with ambient sounds (ambience-like sounds). Ambient sounds are those forming an impression of a (virtual) listening environment, including room reverberation, audience sounds (e.g. applause), environmental sounds (e.g. rain), artistically intended effect sounds (e.g. vinyl crackling) and background noise.
FIG. 23 illustrates the sound image of the original two-channel version and FIG. 24 shows the same for an upmix following the Direct/Ambient-Concept.
B. “In-the-Band”-Concept
Following the “In-the-band”-Concept, every sound, or at least some sounds (direct sound as well as ambient sounds) may be positioned all around the listener. The position of a sound is independent of its characteristics (i.e. whether it is a direct sound or an ambient sound) and only dependent on the specific design of the algorithm and its parameter settings. FIG. 25 illustrates the sound image of the “In-the-band”-Concept.
Apparatus and methods according to the invention relate to the direct/ambient concept. The following section gives an overview of conventional concepts in the context of upmixing an audio signal with m channels to an audio signal with n channels, with m<n.
2 Conventional Concepts in Blind Upmixing
2.1 Upmixing of Mono Recordings
2.1.1 Pseudo-Stereophonic Processing
Most of the techniques to produce a so-called “pseudo-stereophonic” signal are not signal adaptive. This means that they process any mono signal in the same way, no matter what the content is. Those systems often work with simple filter structures and/or time delays to decorrelate the output signals, e.g. by processing two copies of the one-channel input signal by a pair of complementary comb filters [Sch57]. A comprehensive overview of such systems can be found in [Fa105].
2.1.2 Semi-Automatic Mono to Stereo Upmixing Using Sound Source Formation
The authors propose an algorithm to identify signal components (e.g. time-frequency bins of a spectrogram) which belong to the same sound source and should therefore be panned together [LMT07]. The sound source formation algorithm considers principles of stream segregation (derived from the Gestalt principles): continuity in time, harmonic relations in frequency and amplitude similarity. Sound sources are identified using clustering methods (unsupervised learning). The derived “time-frequency-clusters” are further grouped into larger sound streams using (a) information on the frequency range of the objects and (b) timbral similarities. The authors report the use of a sinusoidal modeling algorithm (i.e. the identification of sinusoidal components of a signal) as a front end.
After the sound source formation, the user selects sound sources and applies panning weights to them. It should be noted that (according to some conventional concepts) many of the proposed methods (sinusoidal modeling, stream segregation) do not perform reliable when processing real-world signals of average complexity.
2.1.3 Ambience Extraction Using Non-Negative Matrix Factorization
A time-frequency distribution (TFD) of the input signal is computed, e.g. by means of Short-term Fourier Transform. An estimate of the TFD of the direct signal components is derived by means of the numerical optimization method of Non-negative Matrix Factorization. An estimate of the TFD of the ambient signal is obtained by computing the difference of the TFD of the input signal and the estimate of the TFD of the direct signal (i.e. the approximation residual).
The re-synthesis of the time signal of the ambient signal is carried out using the phase spectrogram of the input signal. Additional post-processing is optionally applied in order to improve the listening experience of the derived multi-channel signal [UWHH07].
2.1.4 Adaptive Spectral Panoramization (ASP)
A method for the panoramization of a mono signal for playback using a stereo sound system is described in [VZA06]. The processing incorporates an STFT, the weighting of the frequency bins used for the re-synthesis of the left and right channel signal, and the inverse STFT. The time-varying weighting factors are derived from low-level features computed from the spectrogram of the input signal in sub-bands.
2.2 Upmixing of Stereo Recordings
2.2.1 Matrix Decoders
Passive matrix decoders compute a multi-channel signal using a time-invariant linear combination of the input channel signals.
Active matrix decoders (e.g. Dolby Pro Logic II [Dre00], DTS NEO:6 [DTS] or HarmanKardon/Lexicon Logic 7 [Kar]) apply an analysis of the input signal and perform signal-dependent adaptation of the matrix elements (i.e. the weights for the linear combination). These decoders use inter-channel differences and signal adaptive steering mechanisms to produce multi-channel output signals. Matrix steering methods aim at detecting prominent sources (e.g. dialogues). The processing is performed in the time domain.
2.2.2 A Method to Convert Stereo to Multi-Channel Sound
Irwan and Aarts present a method to convert a signal from stereo to multichannel [IA01]. The signal for the surround channels is calculated by using a cross-correlation technique (an iterative estimation of the correlation coefficient is proposed in order to reduce the computational load).
The mixing coefficients for the center channel are obtained using Principal Component Analysis (PCA). PCA is applied to calculate a vector, which indicates the direction of the dominant signal. Only one dominant signal can be detected at a time. The PCA is performed using an iterative gradient descent method (which is less demanding with respect to computational load compared to the standard PCA using an eigenvalue decomposition of the covariance matrix of the observation). The computed vector of direction is similar to the output of a goniometer if all decorrelated signal components are neglected. The direction is then mapped from a two-to a three-channel representation to create the 3 front channels.
2.2.3 An Unsupervised Adaptive Filtering Approach of 2-to-5 Channel Upmix
The authors propose an improved algorithm compared to the method by Irwan and Aarts. The originally proposed method is applied to each sub-band [LD05]. The authors assume w-disjoint orthogonality of the dominant signals. The frequency decomposition is carried out using either a Pseudo Quadrature Mirror Filterbank or a wavelet-based octave filter-bank. A further extension to the method by Irwan and Aarts is the use of an adaptive step size for the iterative computation of the (first) principal component.
2.2.4 Ambience Extraction and Synthesis from Stereo Signals for Multi-channel Audio Upmix
Avendano and Jot propose a frequency-domain technique to identify and extract the ambience information in stereo audio signals [AJ02].
The method is based on the computation of an inter-channel coherence index and a non-linear mapping function that allows for the determination of the time-frequency regions that consist mostly of ambience components. Ambient signals are subsequently synthesized and used to feed the surround channels of the multi-channel playback system.
2.2.5 Descriptor Based Spatialization
The authors describe a method for one-to-n upmixing, which can be controlled by an automated classification of the signal [MPA+05]. The paper contains some errors; therefore it might be that the authors aimed at different goals than described in the paper.
The upmix process uses three processing blocks: the “upmix tool”, artificial reverberation and equalization. The “upmix tool” consists of various processing blocks, including the extraction of an ambient signal. The method for the extraction of an ambient signal (“spatial discriminator”) is based on the comparison of the left and right signal of a stereo recording in the spectral domain. For upmixing mono-signals, artificial reverberation is used.
The authors describe 3 applications: 1-to-2 upmixing, 2-to-5 upmixing, and 1-to-5 upmixing.
Classification of the Audio Signal
The classification process uses a supervised learning approach: Low-level features are extracted from the audio signal and a classifier is applied to classify the audio signal into one of three classes: music, voices or any other sounds.
A particularity of the classification process is the use of a genetic programming method to find                optimal features (as compositions of different operations)        optimal combination of the obtained low-level features        the best classifier from a set of available classifiers        the best parameter setting for the chosen classifier            1-to-2 upmixing The upmix is done using reverberation and equalization. If the signal contains voice, the equalization is enabled and reverberation is disabled. Otherwise, the equalization is disabled and reverberation is enabled. No dedicated processing aiming at the suppression of speech in the rear channels is incorporated.    2-to-5 upmixing The authors aim at building a multi-channel soundtrack whereas detected voices are attenuated by muting the center channel.    1-to-5 upmixing The multi-channel signal is generated using reverberation, equalization and the “upmix tool” (which generates a 5.1 signal from a stereo signal. The stereo signal is the output of the reverberation and the input to the “upmix tool”.). Different presets are used for music, voices and all other sounds. By controlling reverberation and equalization, a multi-channel soundtrack is build that keeps voices in the center channel and has music and other sounds in all channels.
If the signal contains voice, the reverberation is disabled. Otherwise, reverberation is enabled. Since the extraction of the rear-channel signal relies on a stereo signal, no rear-channel signal is generated when reverberation is disabled (which is the case for voices).
2.2.6 Ambience-Based Upmixing
Soulodre presents a system, which creates a multi-channel signal from a stereo signal [Sou04]. The signal is decomposed into so-called “individual source streams” and “ambience streams”. Based on these streams a so-called “Aesthetic Engine” synthesizes the multi-channel output. No further technical details of the decomposition and the synthesis steps are given.
2.3 Upmixing of Audio Signals with Arbitrary Number of Channels
2.3.1 Multichannel Surround Format Conversion and Generalized Up-Mix
The authors describe a method based on spatial audio coding using an intermediate mono downmix and introduce an improved method without the intermediate downmix. The improved method comprises passive matrix upmixing and principles known from Spatial Audio Coding. The improvements are gained at the expense of increased data rate of the intermediate audio [GJ07a].
2.3.2 Primary-Ambient Signal Decomposition and Vector-Based Localization for Spatial Audio Coding and Enhancement
The authors propose a separation of the input signal into a primary (direct) signal and an ambient signal using Principal Component Analysis (PCA) [GJ07b].
The input signal is modeled as the sum of a primary (direct) signal and an ambient signal. It is assumed that the direct signals have substantially more energy than the ambient signal and both signals are uncorrelated.
The processing is carried out in the frequency domain. The STFT coefficients of the direct signal are obtained from the projection of the STFT coefficients of the input signal onto the first principal component. The STFT coefficients of the ambient signal are computed from the difference of the STFT coefficients of the input signal and the direct signal.
Since only the (first) principal component (i.e. the eigenvector of the covariance matrix corresponding to the largest eigenvalue) is needed, a computationally efficient alternative for the eigenvalue decomposition used in standard PCA is applied (which is an iterative approximation). The cross-correlation needed for the PCA decomposition is also estimated iteratively. The direct and ambient signal add up to the original, i.e. no information is lost in the decomposition.