The present invention relates to the field of audio signal processing and, in particular, to generating several output channels out of fewer input channels, such as, for example, one (mono) channel or two (stereo) input channels.
Multi-channel audio material is becoming more and more popular. This has resulted in many end users meanwhile being in possession of multi-channel reproduction systems. This can mainly be attributed to the fact that DVDs are becoming increasingly popular and that consequently many users of DVDs meanwhile are in possession of 5.1 multi-channel equipment. Reproduction systems of this kind generally consist of three loudspeakers L (left), C (center) and R (right) which are typically arranged in front of the user, and two loudspeakers Ls and Rs which are arranged behind the user, and typically one LFE-channel which is also referred to as low-frequency effect channel or subwoofer. Such a channel scenario is indicated in FIGS. 5b and 5c. While the loudspeakers L, C, R, Ls, Rs should be positioned with regard to the user as is shown in FIGS. 5b and 5c in order for the user to receive the best hearing experience possible, the positioning of the LFE channel (not shown in FIGS. 5b and 5c) is not that decisive since the ear cannot perform localization at such low frequencies, and the LFE channel may consequently be arranged wherever, due to its considerable size, it is not in the way.
Such a multi-channel system exhibits several advantages compared to a typical stereo reproduction which is a two-channel reproduction, as is exemplarily shown in FIG. 5a. 
Even outside the optimum central hearing position, improved stability of the front hearing experience, which is also referred to as “front image”, results due to the center channel. The result is a greater “sweet spot”, “sweet spot” representing the optimum hearing position.
Additionally, the listener is provided with an improved experience of “delving into” the audio scene, due to the two back loudspeakers Ls and Rs.
Nevertheless, there is a huge amount of audio material, which users own or is generally available, which only exists as stereo material, i.e. only includes two channels, namely the left channel and the right channel. Compact discs are typical sound carriers for stereo pieces of this kind.
The ITU recommends two options for playing stereo material of this kind using 5.1 multi-channel audio equipment.
This first option is playing the left and right channels using the left and right loudspeakers of the multi-channel reproduction system. However, this solution is of disadvantage in that the plurality of loudspeakers already there is not made use of, which means that the center loudspeaker and the two back loudspeakers present are not made use of advantageously.
Another option is converting the two channels into a multi-channel signal. This may be done during reproduction or by special pre-processing, which advantageously makes use of all six loudspeakers of the 5.1 reproduction system exemplarily present and thus results in an improved hearing experience when two channels are upmixed to five or six channels in an error-free manner.
Only then will the second option, i.e. using all the loudspeakers of the multi-channel system, be of advantage compared to the first solution, i.e. when there are no upmixing errors. Upmixing errors of this kind may be particularly disturbing when signals for the back loudspeakers, which are also known as ambience signals, cannot be generated in an error-free manner.
One way of performing this so-called upmixing process is known under the key word “direct ambience concept”. The direct sound sources are reproduced by the three front channels such that they are perceived by the user to be at the same position as in the original two-channel version. The original two-channel version is illustrated schematically in FIG. 5 using different drum instruments.
FIG. 5b shows an upmixed version of the concept wherein all the original sound sources, i.e. the drum instruments, are reproduced by the three front loudspeakers L, C and R, wherein additionally special ambience signals are output by the two back loudspeakers. The term “direct sound source” is thus used for describing a tone coming only and directly from a discrete sound source, such as, for example, a drum instrument or another instrument, or generally a special audio object, as is exemplarily illustrated in FIG. 5a using a drum instrument. There are no additional tones like, for example, caused by wall reflections etc. in such a direct sound source. In this scenario, the sound signals output by the two back loudspeakers Ls, Rs in FIG. 5b are only made up of ambience signals which may be present in the original recording or not. Ambience signals of this kind do not belong to a single sound source, but contribute to reproducing the room acoustics of a recording and thus result in a so-called “delving into” experience by the listener.
Another alternative concept which is referred to as the “in-the-band” concept is illustrated schematically in FIG. 5c. Every type of sound, i.e. direct sound sources and ambience-type tones, are all positioned around the listener. The position of a tone is independent of its characteristic (direct sound sources or ambience-type tones) and is only dependent on the specific design of the algorithm, as is exemplarily illustrated in FIG. 5c. Thus, it was determined in FIG. 5c by the upmix algorithm that the two instruments 1100 and 1102 are positioned laterally relative to the listener, whereas the two instruments 1104 and 1106 are positioned in front of the user. The result of this is that the two back loudspeakers Ls, Rs now also contain portions of the two instruments 1100 and 1102 and no longer ambience-type tones only, as has been the case in FIG. 5b, where the same instruments are all positioned in front of the user.
The expert publication “C. Avendano and J. M. Jot: “Ambience Extraction and Synthesis from Stereo Signals for Multichannel Audio Upmix”, IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 02, Orlando, Fla., May 2002” discloses a frequency domain technique of identifying and extracting ambience information in stereo audio signals. This concept is based on calculating an inter-channel coherency and a non-linear mapping function which is to allow determining time-frequency regions in the stereo signal which mainly consists of ambience components. Ambience signals are then synthesized and used for storing the back channels or “surround” channels Ls, Rs (FIGS. 10 and 11) of a multi-channel reproduction system.
In the expert publication “R. Irwan and Ronald M. Aarts: “A method to convert stereo to multi-channel sound”, The proceedings of the AES 19th International Conference, Schloss Elmau, Germany, Jun. 21-24, pages 139-143, 2001”, a method for converting a stereo signal to a multi-channel signal is presented. The signal for the surround channels is calculated using a cross-correlation technique. A principle component analysis (PCA) is used for calculating a vector indicating a direction of the dominant signal. This vector is then mapped from a two-channel representation to a three-channel-representation in order to generate the three front channels.
All known techniques try in different manners to extract the ambience signals from the original stereo signals or even synthesize same from noise or further information, wherein information which are not in the stereo signal may be used for synthesizing the ambience signals. However, in the end, this is all about extracting information from the stereo signal and/or feeding into a reproduction scenario information which are not present in an explicit form since typically only a two-channel stereo signal and, maybe, additional information and/or meta-information are available.
Subsequently, further known upmixing methods operating without control parameters will be detailed. Upmixing methods of this kind are also referred to as blind upmixing methods.
Most techniques of this kind for generating a so-called pseudo-stereophony signal from a mono-channel (i.e. a 1-to-2 upmix) are not signal-adaptive. This means that they will process a mono-signal in the same manner irrespective of which content is contained in the mono-signal. Systems of this kind frequently operate using simple filtering structures and/or time delays in order to decorrelate the signals generated, exemplarily by processing the one-channel input signal by a pair of so-called complementary comb filters, as is described in M. Schroeder, “An artificial stereophonic effect obtained from using a single signal”, JAES, 1957. Another overview of systems of this kind can be found in C. Faller, “pseudo stereophony revisited”, Proceedings of the AES 118th Convention, 2005.
Additionally, there is the technique of ambience signal extraction using a non-negative matrix factorization, in particular in the context of a 1-to-N upmix, N being greater than two. Here, a time-frequency distribution (TFD) of the input signal is calculated, exemplarily by means of a short-time Fourier transform. An estimated value of the TFD of the direct signal components is derived by means of a numerical optimizing method which is referred to as non-negative matrix factorization. An estimated value for the TFD of the ambience signal is determined by calculating the difference of the TFD of the input signal and the estimated value of the TFD for the direct signal. Re-synthesis or synthesis of the time signal of the ambience signal is performed using the phase spectrogram of the input signal. Additional post-processing is performed optionally in order to improve the hearing experience of the multi-channel signal generated. This method is described in detail by C. Uhle, A. Walther, O. Hellmuth and J. Herre in “Ambience separation from mono recordings using non-negative matrix factorization”, Proceedings of the AES 30th Conference 2007.
There are different techniques for upmixing stereo recordings. One technique is using matrix decoders. Matrix decoders are known under the key word Dolby Pro Logic II, DTS Neo: 6 or HarmanKardon/Lexicon Logic 7 and contained in nearly every audio/video receiver sold nowadays. As a byproduct of their intended functionality, these methods are also able to perform blind upmixing. These decoders use inter-channel differences and signal-adaptive control mechanisms for generating multi-channel output signals.
As has already been discussed, frequency domain techniques as described by Avendano and Jot are used for identifying and extracting the ambience information in stereo audio signals. This method is based on calculating an inter-channel coherency index and a non-linear mapping function, thereby allowing determining the time-frequency regions which consist mostly of ambience signal components. The ambience signals are then synthesized and used for feeding the surround channels of the multi-channel reproduction system.
One component of the direct/ambience upmixing process is extracting an ambience signal which is fed into the two back channels Ls, Rs. There are certain requirements to a signal in order for it to be used as an ambience-time signal in the context of a direct/ambience upmixing process. One prerequisite is that relevant parts of the direct sound sources should not be audible in order for the listener to be able to localize the direct sound sources safely as being in front. This will be of particular importance when the audio signal contains speech or one or several distinguishable speakers. Speech signals which are, in contrast, generated by a crowd of people do not have to be disturbing for the listener when they are not localized in front of the listener.
If a special amount of speech components was to be reproduced by the back channels, this would result in the position of the speaker or of the few speakers to be placed from the front to the back or in a certain distance to the user or even behind the user, which results in a very disturbing sound experience. In particular, in a case in which audio and video material are presented at the same time, such as, for example, in a movie theater, such an experience is particularly disturbing.
One basic prerequisite for the tone signal of a movie (of a sound track) is for the hearing experience to be in conformity with the experience generated by the pictures. Audible hints as to localization thus should not be contrary to visible hints as to localization. Consequently, when a speaker is to be seen on the screen, the corresponding speech should also be placed in front of the user.
The same applies for all other audio signals, i.e. this is not limited to situations, wherein audio signals and video signals are presented at the same time. Other audio signals of this kind are, for example, broadcasting signals or audio books. A listener is used to speech being generated by the front channels and would probably, when all of a sudden speech was to come from the back channels, turn around to restore his conventional experience.
In order to improve the quality of the ambience signals, the German patent application DE 102006017280.9-55 suggests subjecting an ambience signal once extracted to a transient detection and causing transient suppression without considerable losses in energy in the ambience signal. Signal substitution is performed here in order to substitute regions including transients by corresponding signals without transients, however, having approximately the same energy.
The AES Convention Paper “Descriptor-based spatialization”, J. Monceaux, F. Pachet et al., May 28-31, 2005, Barcelona, Spain, discloses a descriptor-based spatialization wherein detected speech is to be attenuated on the basis of extracted descriptors by switching only the center channel to be mute. A speech extractor is employed here. Action and transient times are used for smoothing modifications of the output signal. Thus, a multi-channel soundtrack without speech may be extracted from a movie. When a certain stereo reverberation characteristic is present in the original stereo downmix signal, this results in an upmixing tool to distribute this reverberation to every channel except for the center channel so that reverberation can be heard. In order to prevent this, dynamic level control is performed for L, R, Ls and Rs in order to attenuate reverberation of a voice.