1. Field of the Invention
The present invention relates to the characterization of audio signals in respect of their content and in particular to a concept for classifying or indexing audio pieces in respect of their content so as to make an investigation of such multimedia data possible.
2. Description of the Related Art
In the last few years the availability of multimedia data material, i.e. of audio data, has greatly increased. This development has been conditioned by a number of technical factors. These technical factors include e.g. the wide availability of the internet, the wide availability of powerful computers and the wide availability of powerful methods for data compression, i.e. source coding, of audio data. An example of this is MPEG ½ layer 3, also known as MP3.
The gigantic amounts of audiovisual data which are available worldwide, e.g. from the internet, necessitate concepts which enable these data to be assessed, catalogued and administered on the basis of content criteria. There is a need to search for and find multimedia data by targeting them precisely by entering sensible criteria.
This requires the use of so-called “content-based” techniques, which extract so-called “features”, which represent important characteristic content properties of the signal of interest, from the audiovisual data. On the basis of such features, or combinations of such features, similarities or commonalities between the audio signals can be deduced. This process is normally accomplished by comparing the extracted feature values from various signals, also called “pieces” here, or by setting them in relation to one another.
The U.S. Pat. No. 5,918,223 discloses a method for the content-based analysis, storage, retrieval and segmentation of audio information. An analysis of audio data produces a set of numerical values, which is also known as the feature vector, and which can be used to classify the similarity between individual audio pieces, which are typically stored in a multimedia data bank or in the world wide web, and to arrange them in ranking order.
The analysis also makes it possible to describe user-defined categories of audio pieces based on an analysis of a set of audio pieces which are all members of a user-defined category. The system is capable of finding individual tone sections within a longer tone piece, thus making it possible for the audio recording to be automatically segmented into a series of shorter audio segments.
The loudness of a piece, the pitch, the brightness, the bandwidth and the so-called mel-frequency-cepstral-coefficients (MFCCs) at periodic intervals in the audio piece are used as features for characterizing or classifying audio pieces in respect of their content. The values per block or frame are stored and the first derivative is formed. Specific statistical values are then calculated, e.g. the mean value or the standard deviation, for each of these features including the first derivatives of the same, to describe a variation with time. This set of statistical values forms the feature vector. The feature vector of the audio piece is stored in a data bank in association with the original file. A user can then access the data bank to retrieve the relevant audio pieces.
The data bank system is capable of quantifying the distance in an n-dimensional space between two n-dimensional vectors. It is also possible to produce categories of audio pieces by specifying a set of audio pieces which belong to the same category. Examples of such categories are bird chirping, rock music, etc. The user is enabled to search through the audio data bank using specific methods. The result of such a search is a list of tone files which are listed in order according to their distance from the specified n-dimensional vector. The user can search through the data bank in terms of similarity features, in terms of acoustic or psychoacoustic features, in terms of subjective features or in terms of special noises, e.g. the humming of bees.
The technical publication “Multimedia Content Analysis”, Yao Wang et al., IEEE Signal Processing Magazine, November 2000, pp. 12 to 36, discloses a similar concept for characterizing multimedia pieces. Proposed features for classifying the content of a multimedia piece are time domain features or frequency domain features. These include the loudness, the pitch as the fundamental frequency of an audio signal form, spectral features, e.g. the energy content of a band as a fraction of the total energy content, threshold frequencies in the spectral profile, etc. In addition to short-time features, which relate to the cited quantities per block of sampled audio signal values, long-time features, which relate to a longer duration of the audio piece, are also proposed.
For characterizing audio pieces various categories are proposed, e.g. animal noises, bell noises, crowd noises, laughter, machine noises, musical instruments, male speech, female speech, telephone noises or water noises.
A problem in selecting the features used is that the computational outlay for extracting a feature should be moderate in order to achieve a rapid characterization, but on the other hand the feature should be characteristic for the audio signal in that two different pieces also exhibit distinctive features.
A further problem is the robustness of the feature. In the cited concepts robustness criteria are not discussed. If an audio piece is characterized immediately after being produced in the tone studio and is provided with an index which represents the feature vector of the piece and which captures, essentially, the essence of the piece, the probability is relatively high that this piece will be recognized again if the same undistorted version of the piece is subjected to the same method, i.e. the same features are extracted and the feature vector is then compared in the data bank with a plurality of feature vectors of various pieces.
A problem arises, however, if the audio piece is distorted prior to being characterized, so that the signal to be characterized is no longer identical to the original signal but has the same content. Someone who knows a song will still recognize this song when it is impaired by noise, when it is louder or quieter or when it is played at a different pitch than when it was originally recorded. A further distortion might e.g. have arisen due to data compression involving loss, e.g. by means of a coding method according to an MPEG standard such as MP3 or AAC.
If a distortion or data compression also causes the feature to be substantially impaired, this would mean that the essence gets lost, while a person can still recognize the content of the piece.