1. Field of the Invention
The present invention relates to the extraction of characteristic features from audio signals and matching to a database of such features. Part or all of the extracted features can be used to separate the audio into different categories.
2. Background of the Invention
Audio has always been an important format of information exchange. With the increased adoption of faster internet, larger computer storage and web2.0 services, vast amount of audio information is present. Because of the nature and the vast amount of audio data, searching and retrieving relevant audios are very challenging tasks. One common method used to facilitate the process is by using metadata. For example, the title, genre and the singer information of a song can be coded as bits sequences and embedded in the header of the song. One common implementation of the metadata for music is the ID3 tag for MP3 songs. However, unlike the text information, different audios have different acoustic characteristics. The traditional metadata tagging for audios does not best represent the content and character of the audio information, thus alone by itself is not adequate for searching and retrieving audio information, Hence better ways to automatically identify and classify the audio information based on its content are needed. Further more, the same piece of digital audio information can be represented in different sample frequencies and coded in different compression standards. Extra distortion can be added when the audio information is transmitted through various communication channels. A good set of audio features should not only describe the unique acoustic characteristic of the audio, but also should be robust enough to overcome distortions.
Further, because of personal tastes, a piece of audio information such as music can be liked by some users and disliked by others. When personalized service is preferred, it is important to retrieve the audio information tailored to users interest. To achieve this goal, the content based audio analysis and feature extraction is needed. For example, a website selling music online would like to provide their users with personalized song recommendations. A web software function can monitor individual user's music downloading activity, analyze the music according to its acoustic content. A set of features can be extracted. A search is performed according to the features against the central database. As the result, a list of most relevant music can be presented which is tailored to the user's taste.
The process mentioned above can also be used in the area of intellectual property protection and monitoring. Currently, the popular way to protect the copyrighted audio information is to embed watermarks and metadata. But it is very intrusive and does not correlate to the content of the audio information. It is also prone to altering. In comparison, the features extracted from audio content are a more complete representation of the audio information. They can reside in the header of audio rather than be embedded in the audio stream. By working with the database, which has the collection of pre-known features, it can monitor and identify illegal distributions of copyrighted material over broadcast or P2P networks and etc.
Most content based audio feature extractions are based on the frequency domain analysis. The MFCC method, which was originally used for speech recognitions, has been used widely as an important feature for identifying audio signals. (MFCC has been explained extensively in prior arts). But the MFCC method is based on the short-term frequency analysis, the accuracy of identification is greatly decreased when noise, trans-coding and time shifting are present.
The MFCC feature is normally classified as timbrel texture features. Together with metadata, it can be used in audio classification. Other content based features which has been used in facilitating audio classification includes spectral centroid, mean and the zero crossings of the audio samples, linear prediction coefficients (LPC). Spectral centroid is the centroid of the audio signal spectrum. It measures the spectral brightness of a signal. Mean value of the audio represents the loudness of the audio. Zero crossing is the number of time the signal cross the level zero. It indicates noisiness of the signal. These traditional features described above more or less only capture partial information of audio signals. In order to accurately categorize audio content, more features need to be introduced into the classification process. The more features used, the more accurate and reliable the result will be.
Thus, there is a need for other methods and systems to improve the accuracy of the audio identification and classification. Accordingly, the current invention provides an efficient and robust solution to overcome the existing limitations. Since the human perception to audio information is closely related to the beat or the rhythm of the audio, the current invention extracts acoustic features at the beat onset.