1. Field of the Invention
The present invention relates to analyzing information signals, such as audio signals, and in particular to analyzing information signals consisting of a superposition of partial signals, it being possible for a partial signal to stem from an individual source or a group of individual sources.
2. Description of Prior Art
Ongoing development of digital distribution media for multi-media contents has led to a large variety of data offered. The huge variety of data offered has long exceeded the limits of manageability to human users. Thus, descriptions of the contents of the data by means of metadata become more and more important. In principle, the goal is to make it possible to search not only text files, but also e.g. music files, video files or other information signal files, while envisaging the same conveniences as with common text databases. One approach in this context is the known MPEG 7 standard.
In particular in analyzing audio signals, i.e. signals including music and/or voice, extracting fingerprints is very important.
What is also envisaged is to “enrich” audio data with meta-data so as to retrieve metadata on the basis of a fingerprint, e.g. for a piece of music. The “fingerprint” is to provide a sufficient amount of relevant information, on the one hand, and is to be as short and concise as possible, on the other hand. “Fingerprint” thus designates a compressed information signal which is generated from a music signal and does not contain the metadata but serves to make reference to the metadata, e.g. by searching in a database, e.g. in a system for identifying audio material (“audioID”).
Normally, music data consists of the superposition of partial signals from individual sources. While in pop music, there are typically relatively few individual sources, i.e. the singer, the guitar, the bass guitar, the drums and a keyboard, the number of sources may become very large for an orchestra piece. An orchestra piece and a piece of pop music, for example, consist of a superposition of the tones emitted by the individual instruments. Thus, an orchestra piece, or any piece of music, represents a superposition of partial signals from individual sources, the partial signals being the tones generated by the individual instruments of the orchestra and/or pop music formation, and the individual instruments being individual sources.
Alternatively, even groups of original sources may be regarded as individual sources, so that one signal may be assigned at least two individual sources.
An analysis of a general information signal will be presented below, by way of example only, with reference to an orchestra signal. Analysis of an orchestra signal may be performed in a variety of ways. For example, there may be a desire to recognize the individual instruments and to extract the individual signals of the instruments from the overall signal, and to possibly translate them into musical notation, in which case the musical notation would act as “metadata”. Other possibilities of analysis are to extract a dominant rhythm, it being easier to extract rhythms on the basis of the percussion instruments rather than on the basis of instruments which rather produce tones, also referred to as harmonically sustained instruments. While percussion instruments typically include kettledrums, drums, rattles or other percussion instruments, the harmonically sustained instruments include all other instruments, such as violins, wind instruments, etc.
In addition, percussion instruments include all those acoustic or synthetic sound producers which contribute to the rhythm section on the ground of their sound properties. (e.g. rhythm guitar).
Thus, it would be desirable, for example for rhythm extraction in a piece of music, to extract only percussive portions from the entire piece of music, and to then perform rhythm detection on the basis of these percussive portions without “interfering with” the rhythm detection by signals coming from the harmonically sustained instruments.
On the other hand, any analysis pursuing the goal of extracting metadata which requires exclusively information about the harmonically sustained instruments (e.g. a harmonic or melodic analysis) will benefit from an upstream separation and of further processing of the harmonically sustained portions.
Very recently, there have been reports, in this context, about the utilization of blind source separation (BSS) and independent component analysis (ICA) techniques for signal processing and signal analysis. Fields of applications are, in particular, biomedical technology, communication technology, artificial intelligence and image processing.
Generally, the term BSS includes techniques for separating signals from a mix of signals with a minimum of previous experience with or knowledge of the nature of signals and the mixing process. ICA is a method based on the assumption that the sources underlying a mix are statistically independent of each other at least to a certain degree. In addition, the mixing process is assumed to be invariable in time, and the number of the mixed signals is assumed to be no smaller than the number of the source signals underlying the mix.
Independent subspace analysis (ISA) represents an expansion of ICA. With ISA, the components are subdivided into independent subspaces, the components of which need not be statistically independent. By transforming the music signal, a multi-dimensional representation of the mixed signal is determined, and the latter assumption for the ICA is met. In the last few years, various methods of calculating the independent components have been developed. What follows is relevant literature also dealing, in part, with analyzing audio signals:    [1] M. A. Casey and A. Westner, “Separation of Mixed Audio Sources by Independent Subspace Analysis”, in Proc. of the International Computer Music Conference, Berlin, 2000    [2] I. F. O. Orife, “Riddim: A rhythm analysis and decomposition tool based on independent subspace analysis”, Master thesis, Darthmouth College, Hanover, N.H., 2001    [3] C. Uhle, C. Dittmar and T. Sporer, “Extraction of Drum Tracks from polyphonic Music using Independent Subspace Analysis”, in Proc. of the Fourth International Symposium on Independent Component Analysis, Nara, Japan 2003    [4] D. Fitzgerald, B. Lawlor and E. Coyle, “Prior Subspace Analysis for Drum Transcription”, in Proc. of the 114th AES Convention, Amsterdam, 2003    [5] D. Fitzgerald, B. Lawlor and E. Coyle, “Drum Transcription in the presence of pitched instruments using Prior Subspace Analysis”, in Proc. of the ISSC, Limerick, Ireland, 2003    [6] M. Plumbley, “Algorithms for Non-Negative Independent Component Analysis”, in IEEE Transactions on Neural Networks, 14 (3), pp 534 -543, May 2003
In [1], a method of separating individual sources of mono audio signals is represented. [2] gives an application for a subdivision into single traces, and, subsequently, rhythm analysis. In [3], a component analysis is performed to achieve a subdivision into percussive and non-percussive sounds of a polyphonic piece. In [4], independent component analysis (ICA) is applied to amplitude bases obtained from a spectrogram representation of a drum trace by means of generally calculated frequency bases. This is performed for transcription purposes. In [5], this method is expanded to include polyphonic pieces of music.
The first above-mentioned publication by Casey will be represented below as an example of the prior art. Said publication describes a method of separating mixed audio sources by the technique of independent subspace analysis. This involves splitting up an audio signal into individual component signals using BSS techniques. To determine which of the individual component signals belong to a multi-component subspace, grouping is performed to the effect that the components' mutual similarity is represented by a so-called ixegram. The ixegram is referred to as a cross-entropy matrix of the independent components. It is calculated in that all individual component signals are examined, in pairs, in a correlation calculation to find a measure of the mutual similarity of two components. Thus, exhaustive pair-wise similarity calculations are performed across all component signals, so that what results is a similarity matrix in which all component signals are plotted along a y axis, and in which all component signals are also plotted along the x axis. This two-dimensional array provides, for each component signal, a measure of similarity with one other component signal, respectively. The ixegram, i.e. the two-dimensional matrix, is now used to perform clustering, for which purpose grouping is performed using a cluster algorithm on the basis of dyadic data. To perform optimum partitioning of the ixegram into k categories, a cost function is defined which measures the compactness within a cluster and determines the homogeneity between clusters. The cost function is minimized, so that what eventually results is an allocation of individual components to individual subspaces. If this is applied to a signal which represents a speaker in the context of a continual roaring of a waterfall, what results as the subspace is the speaker, the reconstructed information signal of the speaker subspace exhibiting significant attenuation of the roaring of the waterfall.
What is disadvantageous about the concepts described is the fact that the case where the signal portions of a source will come to lie on different component signals is very likely. This is the reason why, as has been described above, a complex and computing-time-intensive similarity calculation is performed among all component signals to obtain the two-dimensional similarity matrix, on the basis of which a classification of component signals into subspaces will eventually be performed by means of a cost function to be minimized.
What is also disadvantageous is the fact that in the case where there are several individual sources, i.e. where the output signal is not known upfront, even though there will be a similarity distribution after a longish calculation, the similarity distribution itself does not give an actual idea of the actual audio scene. Thus, the viewer knows merely that certain component signals are similar to one another with regard to the minimized cost function. However, he/she does not know which information is contained in these subspaces, which were eventually obtained, and/or which original individual source or which group of individual sources are represented by a subspace.
Independent subspace analysis (ISA) may therefore be exploited to decompose a time-frequency representation, i.e. a spectrogram, of an audio signal into independent component spectra. To this end, the above-described prior methods rely either on a computationally intensive determination of frequency and amplitude bases from the entire spectrogram, or on frequency bases defined upfront. Such frequency bases and/or profile spectra defined upfront consist, for example, in that a piece is said to be very likely to feature a trumpet, and that an exemplary spectrum of a trumpet will then be used for signal analysis.
This procedure has the disadvantage that one has to know all featuring instruments upfront, which goes against, in principle already, to automated processing. A further disadvantage is that, if one wants to operate in a meticulous manner, there are, for example, not only trumpets, but many different kinds of trumpets, all of which differ in terms of their qualities of sound, or timbres, and thus in their spectra. If the approach were to employ all types of exemplary spectra for component analysis, the method again becomes very time-consuming and expensive and gets to exhibit a very high redundancy, since typically not all feasible different kinds of trumpets will feature in one piece, but only trumpets of one single kind, i.e. with one single profile spectrum, or perhaps with very few different timbres, i.e. with few profile spectra. The problem gets worse when it comes to different notes of a trumpet, especially as each tone comprises a spread/contracted profile spectrum, depending on the pitch. Taking this into account also involves a huge computational expenditure.
On the other hand, decomposition on the basis of ISA concepts becomes extremely computationally intensive and susceptible to interference if the entire spectrogram is used. It shall be pointed out that a spectrogram typically consists of a series of individual spectra, a hopping time period being defined between the individual spectra, and a spectrum representing a specific number of samples, so that a spectrum has a specific time duration, i.e. a block of samples of the signal, associated with it. Typically, the duration represented by the block of samples from which a spectrum is calculated is considerably longer than the hopping time so as to obtain a satisfactory spectrogram with regard to the frequency resolution required and with regard to the time resolution required. However, on the other hand it may be seen that this spectrogram representation is extraordinarily redundant. If one considers the case, for example, that a hopping time duration amounts to 10 ms and that a spectrum is based on a block of samples having a time duration of, e.g., 100 ms, every sample will come up in 10 consecutive spectra. The redundancy thus created may cause the requirements in terms of computing time to reach astronomical heights especially if a relatively large number of instruments are searched for.
In addition, the approach of working on the basis of the entire spectrogram is disadvantageous for such cases where not all sources contained are to be extracted from a signal, but where, for example, only sources of a specific kind, i.e. sources having a specific characteristic, are to be extracted. Such a characteristic may relate to percussive sources, i.e. percussion instruments, or to so-called pitched instruments, also referred to as harmonically sustained instruments, which are typical instruments of tune, such as trumpet, violin, etc. A method operating on the basis of all these sources will then be too time-consuming and expensive and, after all, also not robust enough if, for example, only some sources, i.e. those sources which are to meet a specific characteristic, are to be extracted. In this case, individual spectra of the spectrogram, wherein such sources do not occur or occur only to a very small extent, will corrupt, or “blur” the overall result, since these spectra of the spectrogram are self-evidently included into the eventual component analysis calculation just as much as the significant spectra.