1. Field of the Invention:
The present invention relates to characterizing of audio signals with regard to their content and particularly to a concept for classifying and indexing, respectively, of audio pieces with respect to their content, to enable an inquirability of such multimedia data.
2. Description of the Related Art:
Over the last years, the availability of multimedia data material, i.e. of audio data has increased significantly. This development is due to a series of technical factors. These technical factors comprise, for example, the broad availability of the internet, the broad availability of efficient computers as well as the broad availability of efficient methods for data compression, i.e. source encoding, of audio data. One example therefore is MPEG 1/2 layer 3, which is also referred to as MPEG 3.
The huge amounts of audiovisual data that are available worldwide on the Internet, require concepts, which make it possible to evaluate, catalogize or administrate these data according to content criteria. There is a demand to search and find multimedia data in a calculated way according to the specification of useful criteria.
This requires the usage of so-called “content-based” techniques, which extract so-called features from the audiovisual data, which represent important characteristic content properties of the signal of interest. Based on such features and combinations of such features, respectively, similarity relations and common features, respectively, between the audio signals can be derived. This process is generally accomplished by comparing and interrelating, respectively, the extracted feature values from the different signals, which are also referred to as “pieces” herein.
The U.S. Pat. No. 5,918,223 discloses a method for content-based analysis, storage, retrieval and segmentation of audio information. An analysis of audio data generates a set of numerical values, which is also referred to as feature vector, and which can be used to classify and rank the similarity between individual audio pieces, which are typically stored in a multimedia data bank or on the world wide web.
In addition, the analysis enables the description of user-defined classes of audio pieces based on an analysis of a set of audio pieces, which are all members of a user-defined class. The system is able to find individual sound portions within a longer sound piece, which makes it possible that the audio recording is automatically segmented into a series of shorter audio segments.
As features for the characterization and classification, respectively, of audio pieces with regard to their content, the loudness of a piece, the bass content of a piece, the pitch, the brightness, the bandwidth and the so-called Mel-frequency Cepstral coefficients (MFCCs) are used in periodic intervals in the audio piece. The values per block or frame are stored and subjected to a first derivation. Thereupon, specific statistic quantities, such as the mean value or the standard deviation, are calculated from every one of these features including their first deviations, to describe a variation over time. This set of statistical quantities forms the feature vector. The feature vector of the audio piece is stored in a data bank, associated to the original file, where a user can access the data bank to fetch respective audio pieces.
The data bank system is able to quantify the distance in an n-dimensional space between two n-dimensional vectors. It is further possible to generate classes of audio pieces by specifying a set of audio pieces, which belongs into a class. Exemplary classes are twittering of birds, rock music, etc. The user is enabled to search the audio piece data bank by using specific methods. The result of a search is a list of sound files, which are listed in an ordered way according to their distance from the specified n-dimensional vector. The user can search the data bank with regard to similarity features, with regard to acoustic and psychoacoustic features, respectively, with regard to subjective features or with regard to special sounds, such as buzzing of bees.
The expert publication “Multimedia Content Analysis”, Yao Wang etc., IEEE Signal Processing Magazine, November 2000, pp. 12 to 36, discloses a similar concept to characterize multimedia pieces. As features for classifying the content of a multimedia piece, time domain features or frequency domain features are suggested. These comprise the volume, the pitch as base frequency of an audio signal form, spectral features, such as the energy content of a band with regard to the total energy content, cut-off frequencies in the spectral curve, etc. Apart from short-time features, which concern the named quantities per block of samples of the audio signal, long-time quantities are suggested as well, which refer to a longer time interval of the audio piece.
Different categories are suggested for the characterization of audio pieces, such as animal sounds, bell sounds, sounds of a crowd, laughter, machine sounds, musical instruments, male voice, female voice, telephone sounds or water sounds.
The problem of the selection of the used features is that the calculating effort for extracting a feature is to be moderate to obtain a fast characterization, but at the same time the feature is to be characteristically for the audio piece, such that two different pieces also have distinguishable features.
Another problem is the robustness of the feature. The named concepts do not relate to robustness criteria. If an audio piece is characterized immediately after its generation in the sound studio and provided with an index, which represents the feature vector of the piece and, so to speak, forms the essence of the piece, the probability of recognizing this piece is quite high, when the same undistorted version of this piece is subjected to the same method, which means the same features are extracted and the feature vector is then compared with a plurality of feature vectors of different pieces in the data bank.
This will become problematic, however, when an audio piece is distorted prior to its characterization, so that the signal to be characterized is no longer identical to the original signal, but has the same content. A person, for example, who knows a song, will recognize this song even when it is noisy, when it is louder or softer or when it is played in a different pitch than originally recorded. Another distortion could, for example, also have been achieved by a lossy data compression, such as by an encoding method according to an MPEG standard, such as MP3 or AAC.
If a distortion and data compression, respectively, leads to the feature being strongly affected by the distortion and data compression, respectively, this would mean that the essence gets lost, while the content of the piece is still recognizable for a person.
The U.S. Pat. No. 5,510,572 discloses an apparatus for analyzing and harmonizing a tune by using results of a tune analysis. A tune in the form of a sequence of notes, as is it played by a keyboard, is read in and separated into tune segments, wherein a tune segment, i.e. a phrase, comprises, e.g., four bars of the tune. A tonality analysis is performed with every phrase, to determine the key of the tune in this phrase. Therefore, the pitch of a note is determined in the phrase and thereupon, a pitch difference is determined between the currently observed note and the previous note. Further, a pitch difference is determined between the current note and the subsequent note. Due to the pitch differences, a previous coupling coefficient and a subsequent coupling coefficient are determined. The coupling coefficient for the current note results from the previous coupling coefficient and the subsequent coupling coefficient and the note length. This process is repeated for every note of the tune in the phrase, for determining the key of the tune and a candidate for the key of the tune, respectively. The key of the phrase is used to control a note type classification means for interpreting the significance of every note in a phrase. The key information, which has been obtained by the tonality analysis, is further used to select a transposing module, which transposes a chord sequence stored in a data bank in a reference key into the key determined by the tonality analysis for a considered tune phrase.