1. Field of the Invention
The present invention generally relates to an apparatus and a method for robust classification of audio signals, as well as to a method for establishing and operating an audio-signal database, in particular to an apparatus and a method for classifying audio signals wherein a fingerprint for the audio signal is generated and evaluated.
2. Description of Prior Art
In recent years, the availability of multimedia data material has increased more and more. High-performance computers, the strong increase in availability of broad-band data networks, high-performance compression methods, and high-capacity storage media have made a major contribution to this development. There is a particularly strong increase in the number of available audio contents. Audio files coded in accordance with the MPEG1/2-Layer 3 standard, shortly referred to as MP3, are particularly widely used.
The large amount of audio data which very often represent pieces of music makes it necessary to develop apparatus and methods enabling audio data to be classified and specific audio data to be found. Since the audio data are present in various formats which do not enable exact reconstruction of the audio content in every case due to, for example, lossy compression or to transmission via a transmission channel subject to distortion, there is a need for methods which assess and/or compare audio signals on the grounds of a content-based characterization rather than on the grounds of the representation in terms of values.
One field of application of a means for content-based characterization of an audio Signal is, for example, the provision of metadata to an audio signal. This is particularly relevant in connection with pieces of music. Here, the title and the performer may be determined for a given portion of a piece of music. Thus, additional information, e.g. about the album containing the music title, as well as copyright information may also be determined.
With content-based characterization, features of an audio signal must be extracted from the present representation of an audio signal. It has proven advantageous, in particular, to associate an audio signal with a set of data which is obtained on the basis of the audio content of the audio signal and may be used for classifying, searching for or comparing an audio signal. Such a set of data is also referred to as a fingerprint.
In recent years, a number of methods for content-based indexing of audio signals have been published. By means of such apparatus, music signals, or, generally, acoustic signals may be associated with a specific class or pattern on account of a preset property. Thus, acoustic signals may be categorized by specific similarities.
The major requirements placed upon a fingerprint of an audio signal will be described in more detail below. Due to the large number of audio signals available it is necessary that the fingerprint may be produced with moderate computing expenditure. This reduces the time required for generating the fingerprint, and without this, large-scale application of the fingerprint is not possible. In addition, the fingerprint must not take up too much memory In many case it is required to store a large number of fingerprints in one database. It may be required, in particular, to keep a large number of fingerprints in the main memory of a computer. This clearly shows that the data volume of the fingerprint must be clearly smaller than the volume of data of the actual audio signal. It is required, on the other hand, that the fingerprint be characteristic for an audio piece. This means that two audio signals with different contents must also have different fingerprints. In addition, one important requirement placed upon a fingerprint is that the fingerprints of two audio signals which represent the same audio content but differ from each other by, e.g., a distortion, be sufficiently similar so as to be identified as belonging together in a comparison. This property is typically referred to as robustness of the fingerprint. This is particularly important where two audio signals that have been compressed and/or coded using different methods are to be compared. Furthermore, audio signals that have been transmitted via a channel subject to distortion are to have fingerprints which are very similar to the original fingerprint.
A number of methods have already been known by which features and/or fingerprints may be extracted from an audio signal. U.S. Pat. No. 5,918,223 discloses a method for content-based analysis, storage, retrieval and segmentation of audio information. An analysis of audio data creates a set of numerical values which is also referred to as a feature vector and which may be used to classify and rank the similarity between individual audio pieces. The features used for characterizing and/or classifying audio pieces with regard to their contents are the loudness of a piece, the pitch, the clarity of sound, the bandwidth and the so-called Mel-frequency cepstral coefficients (MFCCs) of an audio piece. The values per block or frame are stored and subject to a first time derivation. From this, statistical quantities are calculated, such as the mean value or the standard deviation, the statistical quantities being calculated for each of these features, including the first derivations, thus to describe a variation over time. This set of statistical quantities forms the feature vector. The feature vector is thus a fingerprint of the audio piece and may be stored in a database.
The specialist publication “Multimedia Content Analysis”, Yao Wang et al., IEEE Signal Processing Magazine, November 2000, pages 12 to 36, discloses a similar concept to index and characterize multimedia pieces. To ensure efficient association of an audio signal with a specific class, a number of features and classifiers have been developed. Features proposed for classifying the contents of a multi-media piece are time-domain features or frequency-domain features. These include the volume, the pitch as well as the base frequency of an audio-signal form, spectral features, such as the energy content of a band with regard to the total energy content, cutoff frequencies in the spectral curve and others. In addition to short-term features relating to the so-called quantities per block of samples of the audio signal, long-term quantities are also proposed which relate to a relatively long period of time of the audio piece. Further typical features are formed by forming a time difference of the respective features. The features obtained block by block are rarely passed on as such directly for classification, since their data rate is still much too high. A common form of further processing consists in calculating short-term statistics. This includes, e.g., the formation of a mean value, a variance, and time-related correlation coefficients. This reduces the data rate and results, on the other hand, in an enhanced recognition of an audio signal.
WO 02/065782 describes a method of forming a fingerprint into a multimedia signal. The method is based on the extraction of one or several features from an audio signal. For this purpose, the audio signal is divided into segments, and each segment sees a processing by blocks and frequency bands. The band-by-band calculation of the energy, tonality and standard deviation of the spectrum of power density shall be mentioned as examples.
In addition, DE 101 34 471 and DE 101 09 648 disclose an apparatus and a method for classifying an audio signal, wherein the fingerprint is obtained on the basis of a measure for the tonality of the audio signal. Here, the fingerprint enables audio signals to be classified in a robust and content-based manner. The above documents give several possibilities of generating a tonality measure across an audio signal. In each case, the calculation of the tonality is based on a conversion of a segment of the audio signal to the spectral domain. The tonality can then be calculated in parallel for a frequency band or for all frequency bands. The disadvantage of such a method is that the fingerprint is no longer sufficiently informative as the distortion of the audio signals increases, and that it is then no longer possible to recognize the audio signal with satisfactory reliability. However, distortions occur in very many cases, in particular when audio signals are transmitted via a system exhibiting low transmission quality. Currently, this is the case, in particular, with mobile systems and/or in the event of high data compression. Such systems, such as mobile telephones, are primarily configured for bi-directional transmission of voice signals and frequently transmit music signals only with a very poor quality. This is added to by other factors which may have a negative impact on the quality of a signal transmitted, e.g. microphones of poor quality, channel interferences and transcoding effects. The consequence of a deterioration of the signal quality is a recognition performance which is highly decreased with regard to an apparatus for identifying and classifying a signal. Research has shown that in particular when using an apparatus and/or a method according to DE 101 34 471 and DE 101 09 648, by changes to the system while maintaining the recognition criterion of tonality (spectral flatness measure), no further significant improvements of the recognition performance are possible.
It may be stated that known methods for classifying audio signals and/or for forming a fingerprint of an audio signal mostly cannot meet the demands placed upon them. Problems still exist with regard to the robustness against distortions of the audio signal, also towards interferences superimposed on the audio signal.
In a plurality of current systems for storing and transmitting audio signals, high signal distortions and disturbances occur. This is the case, in particular, when a lossy data compression method or a disturbed transmission channel are used. Lossy compression is used whenever the data rate required for storing or transmitting an audio signal is to be reduced. Examples are data compression according to the MP3 standard and the methods used with digital mobile transceivers. In both cases, low data rates are achieved in that the signals are quantized as coarsely as possible for the transmission. The audio bandwidth is, in part, highly limited. In addition, signal portions which are not perceived at all by the human ear or are only perceived to a very small extent because they are, e.g., masked by other signal portions, are suppressed.
Disturbances, or interferences, on the transmission channel are very frequent with mobile voice transmission applications in common use today. More often than not, in particular, the reception quality is very poor, which becomes noticeable by means of increased noise on the audio signal transmitted. In addition, the transmission may be interrupted completely for a short time, so that a short section of an audio signal to be transmitted is missing completely. During such an interruption, a mobile phone generates a noise signal which is perceived to be less disturbing by a human user than full blanking of the audio signal. Finally, disturbances, or interferences, occur also during the handover from one mobile radio cell to another. All these interference effects must not represent too strong a corruption of the fingerprint, so that an identification of a disturbed audio signal is still possible at a high level of reliability.
Finally, the transmission of audio signals is also influenced by the frequency response characteristic of the audio part. In particular small and cheap components, as are often used with mobile devices, have a pronounced frequency response and thus distort the audio signals to be identified.
While a human listener may identify an audio signal with a high level of reliability even when the interferences and distortions described occur, the recognition performance audio signals decreases significantly, in the occurrence of disturbed, with audio signal recognition means utilizing a conventional fingerprint of an audio signal.