The present invention relates to the field of machine learning, and specifically to the process of feature extraction.
Feature extraction is a step in a machine learning analysis, pattern recognition, image processing, and the like. A feature is a property, usually numeric, of an item in a cohort, such as a property of a subject in a cohort. For example, year-of-birth, height, gender, etc. A feature extraction process may start with initial set of “raw” data (e.g. a database of medical health records, measured data, and/or the like) and derives vectors/sets/matrices of values, called features, to be used by subsequent machine learning analysis steps and the like. Feature extraction may intend to reduce data to be informative and non-redundant, facilitating the subsequent learning and generalization steps, image processing methods, and in some cases leading to better human interpretations. Constructing and identifying features for a specific machine learning application (e.g. predicting the onset of congestive heart failure) using domain knowledge may be termed feature engineering.
For example, when the input data is collected over time, with potentially different number of measured attributes (e.g. lab tests) per each sample (e.g. patient), then it is transformed into a fixed-length set of features. This example process is called feature extraction. The extracted features may contain the relevant information from the input data, so that the desired task may be performed by using a reduced representation instead of the complete input data, which may be large in some cases.
Feature extraction may be performed for a set of items for analysis, called input data or a “cohort”. For example, a cohort may contain patients that appear in a medical database, a file, a storage medium, and/or the like. As used herein the term storage medium or medium means any form of non-transitory computer-readable storage medium, such as an internal hard disk, an external hard disk, a remote external hard disk, a network attached hard disk, a network-based cloud storage medium, and/or the like. Machine learning analysis may involve several, possibly overlapping cohorts. Examples of such cohorts: cohort 1—all patients; cohort 2—all patients having at least one diagnosis of diabetes; cohort 3—all patients having at least one diagnosis of diabetes excluding pregnant women; and the like.
The output of a feature extraction process is a feature matrix, a concatenation of feature vectors for cohort items.
Feature extraction, and feature engineering in particular, may involve multiple iterations of exploring, modifying, and tuning the set of computed features. For example, in a machine learning analysis to build an accurate prediction model, say for the onset of congestive heart failure, feature extraction may be run multiple times to identify the features that result in more accurate predictions.
Machine learning analysis is often conducted in two phases: a train phase and an apply phase. For example, during a train phase, a prediction model may be fit to the data, and during an apply phase this model may be used for making predictions from new data.
Data analysis may involve investigating the complete cohort or various sub-cohorts. For example, a common technique in machine learning for assessing the accuracy or robustness of a method is to apply the machine learning algorithm multiple times to sub-cohorts derived from the original cohort using sampling. For example, in k-fold cross-validation analysis, the analysis is applied k times on k different partitions of the original cohort into train and test cohorts. In bootstrapping, the analysis is repeated several times on different sampling of the data, possibly with repetitions. In both cases, the analyzed cohorts are composed of items of the original cohort.
Memoization is an optimization technique for speeding up computer programs. Briefly, it stores the results of time-consuming function-calls and returns the cached results when the function is called again with the identical inputs. A naïve memoization that, for every new input, runs the function and subsequently caches its output may be inefficient in terms of time and space. For example, consider a symmetric function f(x), i.e. f(x)=f(−x). A memoization that computes f for both x and −x, and stores the two identical values f(x) and f(−x) is inefficient in terms of time and space. A more efficient memoization makes use of f's symmetry and computes f(−x) for negative x. Thus, utilizing properties of the memoized function may improve memoization efficiency.
The foregoing examples of the related art and limitations related therewith are intended to be illustrative and not exclusive. Other limitations of the related art will become apparent to those of skill in the art upon a reading of the specification and a study of the figures.