In machine learning technologies historic data is often used to train the machine to enable the machine to predict an event based on a recent set of data. In many cases, a huge amount of data is available and used to train the machine as good as possible. However, it may require quite a lot of data storage and processing power to use such a long training history. Also some other predictive technologies may directly use a data history of determined events and measured values. Then, the whole stored data history available may be used and when a prediction has to be made on basis of a recent set of data, the whole data history is processed to find similarities between the recent data with the data history—this requires quite a lot of processing power. There is a need to reduce the amount of historic data to be stored while maintaining its predictive benefits.
One approach may be to throw away data of variables (e.g. measurements of specific sensors) that are less relevant to predict the event. This is known as feature selection in traditional machine learning. There exist many methods to accomplish feature selection in traditional machine learning. Some examples are the following: Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Random Forests, and Least Absolute Shrinkage and Selection Operator (LASSO). The first two methods have in common that they focus on the variance as variable of interest. LASSO focuses on minimizing the sum of squared errors, which is similar in flavor to the variance. Random Forest looks at performance loss of randomly permuted data.
It has been shown that the above feature selection solutions are suboptimal and that there is still room for improvement, in particular when one wants to select a specific set of features/variables that is going to be used to predict a specific event.
US2007/0122041A1 discloses a computer implemented method maximizes candidate solutions to a cardinality-constrained combinatorial optimization problem of sparse linear discriminant analysis, and implements the above discussed PCA, LDA and derived methods, which are all based on maximizing variance of the remaining data set using correlation measurements. As the variance is a second order statistic it does not take the full information content of the variables into account.
Above it has been mentioned that the predictive technologies are for predicting “an event”. It has to be noted that the term “event” must be read broadly. “Event” may represent a characteristic of a physical entity, for example, “a component of a machine is going to break down or not”, or “the energy consumption of the factory is going to be too high or not”. These examples relate to a binary prediction: “something is true or not”. However, the above discussed predictive technologies are not limited to binary characteristics and may predict the characteristics also for characteristics that have a value in a higher base numeral system, for example, “the energy consumption of this city is going to be low, medium, or high”. The above discussed predictive technologies may also apply to regression use cases in which a scalar value is obtained based on historical data. The above interpretation also applies to the remaining of this document.