1. Field of the Invention
The present invention relates to characterizing, or identifying, audio signals with regard to their content, in particular to producing and using different fingerprints for an audio signal.
2. Description of Prior Art
Recent years have seen a high increase in the availability of multimedia data material, i.e. audio data. This development is due to a number of technical factors. These technical factors include, for example, the wide availability of the internet, the wide availability of high-performance computers as well as the wide availability of high-performance methods of data compression, i.e. source coding, of audio data. As an example of this, mention shall be made of MPEG 1/2 layer 3, also referred to as MP3.
The huge amounts of audiovisual data available, for example, on the internet, on a worldwide scale, call for concepts enabling these data to be evaluated, categorized or managed by content-related criteria. There is a need to search and find multimedia data specifically by stating useful criteria.
This requires the use of so-called “content-based” techniques extracting, from the audiovisual data, so-called “features” representing important characteristic content properties of the signal of interest. On the basis of such features or combinations of such features, similarity relationships, or common features, between the audio signals may be derived. This process is generally done by comparing, or relating, the extracted feature values from different signals, which shall be referred to as “pieces” herein.
The U.S. Pat. No. 5,918,223 discloses a method for contents-based analysis, storage, retrieval and segmentation of audio information. Analysis of audio data produces a set of numerical values which is also referred to as feature vector and may be used to classify and rank the similarity between the individual audio pieces which are typically stored in a multimedia database or on the worldwide web.
In addition, the analysis enables the description of user-defined classes of audio pieces based on an analysis of a set of audio pieces, which are all members of a user-defined class. The system is able to find individual sound sections within a relatively long sound piece, which enables the audio record to be automatically segmented into a series of shorter audio segments.
The features used for characterizing or classifying audio pieces with regard to their content include the loudness of a piece, the pitch, brightness, bandwidth and so-called Mel-frequency Cepstral coefficients (MFCCs) at periodic intervals in the audio piece. The per-block or per-frame values are stored and subject to a first derivation. Hereupon, specific statistical quantities, for example the mean value or standard deviation, of each of these features, including the first derivatives of same, are computed to describe a variation over time. This set of statistical quantities forms the feature vector. The feature vector of the audio piece is stored in a database with association with the original file, a user being able to access the database so as to fetch appropriate audio pieces.
The database system is capable of quantifying the distance, in an n-dimensional space, between two n-dimensional vectors. It is further possible to produce classes of audio pieces by specifying a set of audio pieces belonging to a class. Examples of classes are bird sounds, rock music, etc. The user is enabled to search the audio-piece database using specific methods. The result of a search is a list of sound files which are listed ordered in accordance with their distance from the specified n-dimensional vector. The user may search the database with regard to similarity features, with regard to acoustic and/or psycho-acoustic features, with regard to subjective features or with regard to specific noises, for example the buzzing of bees.
The specialist publication “Multimedia Content Analysis”, Yao Wang et al., IEEE Signal Processing Magazine, November 2000, pp. 12 to 36, discloses a similar concept for characterizing multimedia pieces. Features for classifying the contents of a multimedia piece are proposed to include time domain features or frequency domain features. These include the loudness, the pitch as the basic frequency of an audio signal shape, spectral features, such as the energy content of a tape in relation to the total energy content, cut-off frequencies in the spectral curve etc. In addition to short-term features, which relate to the quantities mentioned per block of samples of the audio signal, long-term quantities relating to a longer period of the audio piece are also proposed.
Various categories are suggested for characterizing audio pieces, such as animal sounds, the ringing of bells, sounds of a crowd of people, laughter, machine noise, musical instruments, the male voice, the female voice, telephone sounds or sounds of water.
The choice of features used is problematic in that the computing expenditure required for extracting a feature is supposed to be moderate so as to quickly achieve a characterization, but in that, at the same time, the feature is to be characteristic of the audio piece such that two different pieces have features differing from each other.
For characterizing an audio signal, a characterization of the audio signal, or a so-called feature, which is also referred to as fingerprint, is extracted, as has already been described. Two different requirements are placed upon the type of feature. The one requirement placed upon a fingerprint is that the fingerprint is to signal the audio signal as uniquely as possible. The other requirement placed upon the fingerprint is that the fingerprint is to contain as little information as possible, i.e. that the fingerprint is to use as little memory space as possible. These two requirements conflict with each other. The simplest way to recognize this is the fact that the best “fingerprint” for an audio signal is the audio signal itself, i.e. the sequence of samples represented by the audio signal. Such a fingerprint, however, would present a large-scale violation of the second requirement, since the fingerprint of the audio signal would take up far too much memory, which would, for one thing, make it impossible to store a very large number of fingerprints for a very large number of audio signals in a music recognition database. A further disadvantage is also that the amount of computing time required by matching algorithms, which are supposed to compare a search fingerprint with a plurality of stored database fingerprints, is proportional to the size of the search fingerprint and/or the database fingerprint.
The other extreme would be, for example, to only take a mean value of all samples of a piece. This mean value requires only very little memory space and is therefore best suited both for a large music database and for matching algorithms. However, the characterizing strength of such a fingerprint would not be very robust towards a change irrelevant for a person.
An ideal compromise between the characterizing strength, on the one hand, and the data volume of the fingerprint, on the other hand, does not exist as such in general, but is typically established empirically or depends on the circumstances of the respective application in terms of memory space available and transmission capacity available. This procedure has the drawback that the different types of fingerprints are ideally suited for only one specific application, but are more or less unsuitable for other applications. It shall be pointed out in this context that audio signal identification and/or characterization is of particular interest only if there are very large feature databases whose fingerprints could be compared to a search fingerprint to either directly identify an audio signal or to characterize the audio signal to the effect that a measure of similarity of the audio signal compared to one or several of the audio signals in the database is output. If it is found that a specific type of fingerprint was indeed favorable for the one application, but was no longer favorable for the other application, renewed feature extraction processing must be performed for the large amount of audio signals whose fingerprints are stored in the database to achieve a new feature database serving as an ideal compromise for current applications in order to achieve an ideal compromise between the characterizing strength, on one hand, and the memory space, on the other hand. On the one hand, the original pieces are not at all available for a renewed feature extraction (for example, 500.000 audio pieces are used for an audio database). On the other hand, this results—if it is at all possible—in large-scale expenditure for feature extraction processing to fill and/or to “train” the “new” database.
This problem is aggravated in particular by the fact that although there is indeed a worldwide web available in the form of the internet, which in principle has an almost unlimited storage capacity, it is however, impossible to let many different “fingerprint producers” know at any time which fingerprint is most suitable for which application, such that there also is always sufficient fingerprint database material available to be able to perform useful audio signal identification and/or characterization.
A further problem is that fingerprints should also be transmitted via most varied transmission channels. A trans-mission channel having a very low transmission capacity is, for example, an outdoor transmission channel of a mobile phone. In addition to the characterizing strength and the storage capacity for the database, the bandwidth of the transmission channel also is a decisive factor. It would make no sense to produce a fingerprint having a high characterizing strength but which can hardly or not at all be transmitted via the narrow-band transmission channel. The ideal fingerprint for such an application is therefore specified additionally by the transmission channel via which the fingerprint, e.g. of a search database, is to be transmitted.