The present invention relates to the fingerprint technology for audio signals and in particular to calculating a fingerprint, using a fingerprint for synchronizing multichannel extension data with an audio signal and characterizing an audio signal with the fingerprint.
Currently developed technologies allow an ever more efficient transmission of audio signals by data reduction, but also an increase of audio enjoyment by extensions, such as by the usage of multichannel technology.
Examples for such an extension of common transmission techniques have become known under the name of “Binaural Cue Coding” (BCC) as well as “Spatial Audio Coding”. Regarding this, reference is made exemplarily to J. Herre, C. Faller, S. Disch, C. Ertel, J. Hilpet, A. Hoelzer, K. Linzmeier, C. Spenger, P. Kroon: “Spatial Audio Coding: Next-Generation Efficient and Compatibel Coding Oberfläche Multi-Channel Audio”, 117th AES Convention, San Francisco 2004, Preprint 6186.
In a sequentially operating transmission system, such as radio or Internet, such methods separate the audio program to be transmitted into audio base data or an audio signal, which can be a mono or also a stereo downmix audio signal, and into extension data that can also be referred to as multichannel additional information or multichannel extension data. The multichannel extension data can be broadcast together with the audio signal, i.e. in a combined manner, or the multichannel extension data can also be broadcast separately from the audio signal. As an alternative to broadcasting a radio program, the multichannel extension data can also be transmitted separately, for example to a version of the downmix channel already existing on the user side. In this case, transmission of the audio signal, for example in the form of an interne download or a purchase of a compact disc or DVD takes place spatially and temporally separate from the transmission of the multichannel extension data, which can be provided, for example, from a multichannel extension data server.
Basically, the separation of a multichannel audio signal into an audio signal and multichannel extension data has the following advantages. A “classic” receiver is able to receive and replay audio base data, i.e. the audio signal at any time, independent of content and version of the multichannel additional data. This characteristic is referred to as reverse compatibility. In addition to that, a receiver of the newer generation can evaluate the transmitted multichannel additional data and combine the same with the audio base data, i.e. the audio signal, in such a manner that the complete extension, i.e. the multichannel sound, can be provided to the user.
In an exemplary application scenario in digital radio, with the help of these multichannel extension data, the previously broadcast stereo audio signal can be extended to the multichannel format 5.1 with little additional transmission effort. The multichannel format 5.1 comprises five replay channels, i.e. a left channel L, a right channel R, a central channel C, a left rear channel LS (left surround) and a right rear channel RS (right surround). For this, the program provider generates the multichannel additional information on the transmitter side from multichannel sound sources, such as they are found, for example, on a DVD/audio/video. Subsequently, this multichannel additional information can be transmitted in parallel to the audio stereo signal broadcast as before, which now includes a stereo downmix of the multichannel signal.
One advantage of this method is the compatibility with the so far existing digital radio transmission system. A classical receiver that cannot evaluate this additional information will be able to receive and replay the two-channel sound signal as before without any limitations regarding quality.
A receiver of novel design, however, can evaluate and decode the multichannel information and reconstruct the original 5.1 multichannel signal from the same, in addition to the stereo sound signal received so far.
For allowing simultaneous transmission of the multichannel additional information as a supplement to the stereo sound signal used so far, two solutions are possible for compatible broadcast via a digital radio system.
The first solution is to combine the multichannel additional information with the coded downmix audio signal such that they can be added to the data stream generated by an audio encoder as a suitable and compatible extension. In this case, the receiver only sees one (valid) audio data stream and can again, synchronously to the associated audio data block, extract and decode the multichannel additional information by means of a correspondingly preceding data distributor and output the same as a 5.1 multichannel sound.
This solution necessitates the extension of the existing infrastructure/data paths, such that they can now transport the data signals consisting of downmix signals and extension instead of merely the stereo audio signals as before. This is, for example, possible without additional effort, or unproblematic, when it is a data-reduced illustration, i.e. a bit stream transmitting the downmix signals. A field for the extension information can then be inserted into this bit stream.
A second possible solution is to couple the multichannel additional information not to the used audio coding system. In this case, the multichannel extension data are not coupled into the actual audio data stream. Instead, transmission is performed via a specific but not necessarily temporarily synchronized additional channel, which can, for example, be a parallel digital additional channel. Such a situation occurs, for example, when the downmix data, i.e. the audio signal, are routed through a common audio distribution infrastructure existing in studios in unreduced form, e.g. as PCM data per AES/EBU data format. These infrastructures are aimed at distributing audio signals digitally between various sources (“crossbars”) and/or processing them, for example by means of sound regulation, dynamic compression, etc.
In the second possible solution described above, the problem of time offset of the downmix audio signal and multichannel additional information in the receiver can occur, since both signals pass through different, non-synchronized data paths. A time offset between downmix signal and additional information, however, causes deterioration of the sound quality of the reconstructed multichannel signal, since then an audio signal with multichannel extension data, which actually do not belong to the current audio signal but to an earlier or later portion or block of the audio signal, is processed on the replay side.
Since the order of magnitude of the time offset can no longer be determined from the received audio signal and the additional information, a time-correct reconstruction and association of the multichannel signal in the receiver is not ensured, which will result in quality losses.
A further example for this situation is when an already running 2-channel transmission system is to be extended to multichannel transmission, for example when considering a receiver for digital radio. Here, it is often the case that decoding of the downmix signal frequently takes place by means of an audio decoder already existing in the receiver, which means, for example, a stereo audio decoder according to the MPEG 4 standard. The delay time of this audio decoder is not known or cannot be predicted exactly, due to the system-immanent data compression of audio signals. Hence, the delay time of such an audio decoder cannot be compensated reliably.
In the extreme case, the audio signal can also reach the multichannel audio decoder via a transmission chain including analog parts. Here, digital/analog conversion takes place at a certain point in the transmission, which is followed again by analog/digital conversion after a further storage/transmission. Here also, no indications are available as to how a suitable delay compensation of the downmix signal in relation to the multichannel additional data can be performed. When the sampling frequency for the analog/digital conversion and the digital/analog conversion differ slightly, even a slow time drift of the necessitated compensation delay results according to the ratio of the two sampling rates to each other.
German patent DE 10 2004 046 746 B4 discloses a method and an apparatus for synchronizing additional data and base data. A user provides a fingerprint based on his stereo data. An extension data server identifies the stereo signal based on the obtained fingerprint and accesses a database for retrieving the extension data for this stereo signal. In particular, the server identifies an ideal stereo signal corresponding to the stereo signal existing at the user and generates two test fingerprints of the ideal audio signal belonging to the extension data. These two test fingerprints are then provided to the client who determines a compression/expansion factor and a reference offset therefrom, wherein, based on the reference offset, the additional channels are expanded/compressed and cut off at the beginning and the end. Thereupon, a multichannel file can be generated by using the base data and the extension data.
Generally speaking, fingerprint technologies have to be characteristic for an audio signal. On the other hand, they should also be an equally highly compressed representation of an audio signal. This means that the fingerprint may use up significantly less memory space than the audio signal itself, since otherwise generating a fingerprint and using a fingerprint would be useless.
On the other hand, a fingerprint should reproduce the time curve of an audio signal in order to be suitable, on the one hand, for synchronization purposes and, on the other hand, also for identification purposes. In particular with regard to identification or characterization purposes, there is frequently the situation that an audio signal, such as a radio transmission, does not fully replay an audio piece, but starts transmitting at a certain time in the piece and possibly even stops transmitting before the piece has ended. However, the fingerprint does not need to be decompressable since fingerprint generation can be considered as a particularly lossy compression.
Since fingerprint information is additional information, it should, as mentioned above, be a representation that is as compressed as possible but nevertheless characteristic. It is a further advantage of the compressed representation that the more compressed the representation is, the faster and easier to handle any correlations will be performed, i.e. calculation methods where a fingerprint is involved, e.g. for synchronizing or characterizing an audio signal.