1. Field of the Invention
The present invention generally concerns real-time audio analysis. More specifically, the present invention concerns machine learning, audio signal processing, and sound object recognition and labeling.
2. Description of the Related Art
Analysis of audio and video data invokes the use of “metadata” that describes different elements of media content. Various fields of production and engineering are becoming increasingly reliant and sophisticated on the use of metadata, including music information retrieval (MIR), audio content identification (finger-printing), automatic (reduced) transcription, summarization (thumb-nailing), source separation (de-mixing), multimedia search engines, media data-mining, and content recommender systems.
In an audio-oriented system using metadata, a source audio signal is typically broken into small “windows” of time (e.g., 10-100 milliseconds in duration). A set of “features” is derived by analyzing the different characteristics of each signal window. The set of raw data-derived features is the “feature vector” for an audio selection. This feature vector can vary from a short single instrument note sample, a two-bar loop, a song, or a complete soundtrack. A raw feature vector typically includes time-domain values (sound amplitude measures) and frequency-domain values (sound spectral content).
The particular set of raw feature vectors derived from any audio analysis may greatly vary from one audio metadata application to another. This variance is often dependent upon, and therefore fixed by, post-processing requirements and the run-time environment of a given application. As the feature vector format and contents in many existing software implementations are fixed, it is difficult to adapt an analysis component for new applications. Furthermore, there are challenges to providing a flexible first-pass feature extractor that can be configured to set up a signal analysis processing phase.
In light of these limitations, some systems perform second-stage “higher-level” feature extraction based on the initial analysis. For example, the second-stage analysis may derive information such as tempo, key, or onset detection as well as feature vector statistics, including derivatives/trajectories, smoothing, running averages, Gaussian mixture models (GMMs), perceptual mapping, bark/sone maps, or result data reduction and pruning. These second-stage analysis functions are generally custom-coded for applications making it equally challenging to develop and configure the second-stage feature vector mapping and reduction processes described above for new applications.
An advanced metadata processing system would add a third stage of numeric/symbolic machine-learning, data-mining, or artificial intelligence modules. Such a processing stage might invoke techniques such as support vector machines (SVMs), artificial neural networks (NNs), clusterers, classifiers, rule-based expert systems, and constraint-satisfaction programming. But while the goal of such a processing operation might be to add symbolic labels to the audio stream, either as a whole (as in determining the instrument name of a single-note audio sample, or the finger-print of a song file), or with time-stamped labels and properties for some manner of events discovered in the stream, it is a challenge to integrate multi-level signal processing tools with symbolic machine-learning-level operations into flexible run-time frameworks for new applications.
Frameworks in the literature generally support only a fixed feature vector and one method of data-mining or application processing. These prior art systems are neither run-time configurable or scriptable nor are they easily integrated with a variety of application run-time environments. Audio metadata systems tend to be narrowly focused on one task or one reasoning component, and there is a challenge to provide configurable media metadata extraction.
There is a need in the art for a flexible and extensible framework that allows developers of multimedia applications or devices to perform signal analysis, object recognition, and labeling of live or stored audio data and map the resulting metadata as control signals or configuration information for a corresponding software or hardware implementation.