In data processing, feature extraction is a special form of dimensionality reduction that effectively represents the interest points in a feature vector. When the input data to a process (algorithm) is too large to be processed and it includes some redundancy, the input data may be transformed into a reduced representation set of features, i.e., a feature vector. If done properly, the features set includes the relevant information from the input data to perform the desired task using the reduced representation by the feature vectors. Feature extraction techniques simplify the amount of data required to describe a large set of data accurately.
Feature extraction has been widely used in image processing and optical character recognition which use different algorithms to detect and isolate various features of a dataset, for example, digitized image, video stream, speech, or text. In a typical pattern recognition problem, feature extraction is followed by classification which is a method of identifying to which of a set of categories (classes) a new observation belongs, on the basis of a training set of data containing observations (or instances) for which their category is already known.
In the context of machine learning, supervised classification is considered an instance of learning, where a training set of correctly identified observations (features) is available (training set). On the other hand, the unsupervised procedures such as clustering, comprise of grouping data into categories based on some measure of inherent similarity. For example, the distance between instances, considered as vectors in a multi-dimensional vector space. In machine learning, the observations are often known as instances, the explanatory variables are termed features and the possible categories to be predicted are classes.
A common subclass of classification is probabilistic classification. Algorithms of this nature use statistical inference to find the best class for a given instance. Unlike other algorithms, which simply output a best class, probabilistic algorithms output a probability of the instance being a member of each of the possible classes. Typically, the best class is then selected as the one with the highest probability.
Classification and clustering are examples of the more general problem of pattern recognition, which is the assignment of an output value to a given input value. Other examples are regression, which assigns a real-valued output to each input; or sequence labeling, which assigns a class to each member of a sequence of values, for example, part of speech tagging, which assigns a part of speech to each word in an input sentence.
A training set is a set of data used to discover potentially predictive relationships. In the machine learning field, a training set includes an input feature vector and a corresponding label vector, which are used together with a supervised learning method to train a knowledge database. In statistical modeling, a training set is used to fit a model that can be used to predict a “response value” from one or more “predictors.” The fitting can include both variable selection and parameter estimation. Statistical models used for prediction are often called regression models, of which linear regression and logistic regression are two examples.
Typically, emphasis is placed on avoiding overfitting of the training set, so as to achieve the best possible performance on an independent test set that approximately follows the same probability distribution as the training set. A training set is often used in conjunction with a test set, which is a set of data used to assess the strength and utility of a predictive relationship that is using the learned parameters to predict and compare the accuracy.
Measuring the distance between high-dimensional feature vectors arises in several computer vision tasks and other similar pattern recognition applications. Mahalanobis distance appears naturally in classification tasks where we want samples from the same class to be close and samples from different class to be well separated. Multiple approaches have been proposed for metric learning within this framework, including formulations based on maximizing absolute or relative distances between dissimilar data points, nearest neighbors, largest margin nearest neighbors, and Within Class Covariance Normalization (WCCN). These methods depend on a full rank non-diagonal matrix and computationally intense operations such as matrix inverse, and thus, become intractable for high dimensional data.
In order to avoid the typical complexities of dimensionality, several feature selection techniques have been proposed. The most popular ones include unsupervised techniques such as Principal Component Analysis (PCA) which aims to find a linear feature subspace that minimizes the reconstruction error. Likewise, the supervised dimension reduction techniques, such as Fisher Linear Discriminant Analysis (FLDA) that finds subspace that maximizes inter-class separation and minimizes intra-class separation. Dimensionality Reduction makes some directions degenerate, hence omitting those directions. On the other hand, in feature transformation techniques such as WCCN, the feature space remains same but the transformation results in suppressing (or strengthening) of some directions corresponding to an optimal distance criterion.
Accordingly, there is a need for a computationally efficient feature transformation technique that is less complex, computationally feasible and scales linearly with dimensionality of the vectors.