1. Technical Field
The invention is related to a signal feature extractor, and in particular, to a system and method for using a “distortion discriminant analysis” of a set of training signals to define parameters of a feature extractor for extracting distortion-robust features from signals having one or more dimensions, such as audio signals, images, or video data.
2. Related Art
There are many existing schemes for extracting features from signals having one or more dimensions, such as audio signals, images, or video data. For example, with respect to a one-dimensional signal such as an audio signal or audio file, audio feature extraction has been used as a necessary step for classification, retrieval, and identification tasks involving the audio signal. For identification, the extracted features are compared to a portion of an audio signal for identifying either elements within the audio signal, or the entire audio signal. Such identification schemes are conventionally known as “audio fingerprinting.”
Conventional schemes for producing features for pattern matching in signals having one or more dimensions typically approach the problem of feature design by handcrafting features that it is hoped will be well-suited for a particular identification task. For example, current audio classification, segmentation and retrieval methods use heuristic features such as the mel cepstra, the zero crossing rate, energy measures, spectral component measures, and derivatives of these quantities. Clearly, other signal types make use of other heuristic features that are specific to the particular type of signal being analyzed.
For example, one conventional audio classification scheme provides a hierarchical scheme for audio classification and retrieval based on audio content analysis. The scheme consists of three stages. The first stage is called a coarse-level audio segmentation and classification, where audio recordings are segmented and classified into speech, music, several types of environmental sounds, and silence, based on morphological and statistical analysis of temporal curves of short-time features of audio signals. In the second stage, environmental sounds are further classified into finer classes such as applause, rain, birds' sound, etc. This fine-level classification is based on time-frequency analysis of audio signals and use of the hidden Markov model (HMM) for classification. In the third stage, the query-by-example audio retrieval is implemented where similar sounds can be found according to an input sample audio.
Another conventional scheme approaches audio content analysis in the context of video structure parsing. This scheme involves a two-stage audio segmentation and classification scheme that segments and classifies an audio stream into speech, music, environmental sounds, and silence. These basic classes are the basic data set for video structure extraction. A two-stage algorithm is then used to identify and extract audio features. In particular, the first stage of the classification is to separate speech from non-speech, based on simple features such as high zero-crossing rate ratio, low short-time energy ratio, spectrum flux and Linear Spectral Pairs (LSP) distance. The second stage of the classification further segments non-speech class into music, environmental sounds and silence with a rule based classification scheme.
Still another conventional scheme provides an audio search engine that can retrieve sound files from a large corpus based on similarity to a query sound. With this scheme, sounds are characterized by “templates” derived from a tree-based vector quantizer trained to maximize mutual information (MMI). Audio similarity is measured by simply comparing templates. The basic operation of the retrieval system involves first accumulating and parameterizing a suitable corpus of audio examples into feature vectors. The corpus must contain examples of the kinds (classes) of audio to be discriminated between, e.g., speech and music, or male and female talkers. Next, a tree-based quantizer is constructed using a manually “supervised” operation which requires the training data to be labeled, i.e., each training example must be associated with a class. The tree automatically partitions the feature space into regions (“cells”) which have maximally different class populations. To generate an audio template for subsequent retrieval, parameterized data is quantized using the tree. To retrieve audio by similarity, a template is constructed for the query audio. Comparing the query template with corpus templates yields a similarity measure for each audio file in the corpus. These similarity measures can then be sorted by similarity and the results presented as a ranked list.
Another approach to feature extraction has been applied in the area of speech recognition and speech processing. For example, one conventional scheme provides a method for decomposing a conventional LPC-cepstrum feature space into subspaces which carry information about linguistic and speaker variability. In particular, this scheme uses oriented principal component analysis (OPCA) to estimate a subspace which is relatively speaker independent.
A related OPCA technique builds on the previous scheme by using OPCA for generating speaker identification or verification models using speaker information carried in the speech signal. This scheme is based on a three step modeling approach. In particular, this scheme first extracts a number of speaker-independent feature vectors which include linguistic information from a target speaker. Next, a set of speaker-dependent feature vectors which include both linguistic and speaker information are extracted from the target speaker. Finally, a functional mapping between the speaker-independent and the speaker-dependent features is computed for transforming the speaker-independent features into speaker-dependent features to be used for speaker identification.
However, while the aforementioned schemes are useful, they do have limitations. For example, a feature extractor system designed with heuristic features such as those discussed above is not typically optimal across multiple types of distortion or noise in a signal. In fact, different features than those selected or extracted often give better performance, or are more robust to particular types of noise or distortion. Further, with respect to the OPCA based schemes, these schemes do not effectively address noise or distortions in the signal being analyzed over wide temporal or spatial windows.
Therefore, what is needed is a system and method for extracting features from a set of representative training data such that the features extracted will be robust to both distortion and noise when used for feature classification, retrieval, or identification tasks involving an input signal.