Predictive modeling refers to generating a model from a given set of data records of both input parameters and output parameters and predicting actual output parameters corresponding to actual input parameters based on the model. Predictive modeling techniques are often used to build relationships among variables by using data records collected through experimentation, simulation, or physical measurement or other techniques. Predictive models may be built from data by using various methods for many different families of models, such as decision trees, decision lists, linear equations, and neural networks.
The data records used to build a model are known as training data records. In certain situations, the number of data records may be limited by the number of systems that can be used to generate the data records. In these situations, the number of variables may be greater than the number of available data records, which creates so-called sparse data scenarios. In certain other situations, the training data records may be unable to cover the entire input space of the input parameters or the training data records may be discrete such that uniform relationships represented by a single predictive model between input parameters and output parameters may be unavailable across the entire input space and/or output space. In certain further situations, the training data records may include variables with missing or erroneous values.
Techniques exist for determining and discarding variables with missing or erroneous data. For example, U.S. Pat. No. 6,519,591 (the '591 patent) issued on Feb. 11, 2003 to Cereghini et al., discloses a method for performing cluster analysis inside a relational database management system. The method defines a plurality of tables for the storage of data points and Gaussian mixture parameters and executes a series of SQL statements implementing an Expectation-Maximization clustering algorithm. A distance-based clustering approach identifies those regions in which points are close to each other according to some distance function. The squared Mahalanobis distance function is the basis of implementing Expectation-Maximization clustering in SQL. One advantage is that Expectation-Maximization clustering is robust to noisy data and missing information.
However, while the system and method of the '591 patent are useful for implementing a distance-based clustering algorithm, and may be robust to noisy data and missing information, the method of the '591 patent cannot repair abnormal input variables. The method also cannot replace a missing data point in an input variable, thereby increasing the number of input variables that may be used to generate a virtual sensor process model.
The disclosed embodiments are directed to improvements in the existing technology.