1. Field of the Invention
The present invention relates generally to the field of predictive system models. More particularly, the present invention relates to preprocessing of input data so as to correct for different time scales, transforms, missing or bad data, and/or time-delays prior to input to a support vector machine for either training of the support vector machine or operation of the support vector machine.
2. Description of the Related Art
Many predictive systems may be characterized by the use of an internal model which represents a process or system for which predictions are made. Predictive model types may be linear, non-linear, stochastic, or analytical, among others. However, for complex phenomena non-linear models may generally be preferred due to their ability to capture non-linear dependencies among various attributes of the phenomena. Examples of non-linear models may include neural networks and support vector machines (SVMs).
Generally, a model is trained with training data, e.g., historical data, in order to reflect salient attributes and behaviors of the phenomena being modeled. In the training process, sets of training data may be provided as inputs to the model, and the model output may be compared to corresponding sets of desired outputs. The resulting error is often used to adjust weights or coefficients in the model until the model generates the correct output (within some error margin) for each set of training data. The model is considered to be in “training mode” during this process. After training, the model may receive real-world data as inputs, and provide predictive output information which may be used to control the process or system or make decisions regarding the modeled phenomena. It is desirable to allow for pre-processing of input data of predictive models (e.g., non-linear models, including neural networks and support vector machines), particularly in the field of e-commerce.
Predictive models may be used for analysis, control, and decision making in many areas, including electronic commerce (i.e., e-commerce), e-marketplaces, financial (e.g., stocks and/or bonds) markets and systems, data analysis, data mining, process measurement, optimization (e.g., optimized decision making, real-time optimization), quality control, as well as any other field or domain where predictive or classification models may be useful and where the object being modeled may be expressed abstractly. For example, quality control in commerce is increasingly important. The control and reproducibility of quality is be the focus of many efforts. For example, in Europe, quality is the focus of the ISO (International Standards Organization, Geneva, Switzerland) 9000 standards. These rigorous standards provide for quality assurance in production, installation, final inspection, and testing of processes. They also provide guidelines for quality assurance between a supplier and customer.
A common problem that is encountered in training support vector machines for prediction, forecasting, pattern recognition, sensor validation and/or processing problems is that some of the training/testing patterns may be missing, corrupted, and/or incomplete. Prior systems merely discarded data with the result that some areas of the input space may not have been covered during training of the support vector machine. For example, if the support vector machine is utilized to learn the behavior of a chemical plant as a function of the historical sensor and control settings, these sensor readings are typically sampled electronically, entered by hand from gauge readings, and/or entered by hand from laboratory results. It is a common occurrence in real-world problems that some or all of these readings may be missing at a given time. It is also common that the various values may be sampled on different time intervals. Additionally, any one value may be “bad” in the sense that after the value is entered, it may be determined by some method that a data item was, in fact, incorrect. Hence, if a given set of data has missing values, and that given set of data is plotted in a table, the result may be a partially filled-in table with intermittent missing data or “holes”. These “holes” may correspond to “bad” data or “missing” data.
Conventional support vector machine training and testing methods require complete patterns such that they are required to discard patterns with missing or bad data. The deletion of the bad data in this manner is an inefficient method for training a support vector machine. For example, suppose that a support vector machine has ten inputs and ten outputs, and also suppose that one of the inputs or outputs happens to be missing at the desired time for fifty percent or more of the training patterns. Conventional methods would discard these patterns, leading to no training for those patterns during the training mode and no reliable predicted output during the run mode. The predicted output corresponding to those certain areas may be somewhat ambiguous and/or erroneous. In some situations, there may be as much as a 50% reduction in the overall data after screening bad or missing data. Additionally, experimental results have shown that support vector machine testing performance generally increases with more training data, therefore throwing away bad or incomplete data may decrease the overall performance of the support vector machine.
Another common issue concerning input data for support vector machines relates to situations when the data are retrieved on different time scales. As used herein, the term “time scale” is meant to refer to any aspect of the time-dependency of data. As is well known in the art, input data to a support vector machine is generally required to share the same time scale to be useful. This constraint applies to data sets used to train a support vector machine, i.e., input to the SVM in training mode, and to data sets used as input for run-time operation of a support vector machine, e.g., input to the SVM in run-time mode. Additionally, the time scale of the training data generally must be the same as that of the run-time input data to insure that the SVM behavior in run-time mode corresponds to the trained behavior learned in training mode.
In one example of input data (for training and/or operation) with differing time scales, one set of data may be taken on an hourly basis and another set of data taken on a quarter hour (i.e., every fifteen minutes) basis. In this case, for three out of every four data records on the quarter hour basis there will be no corresponding data from the hourly set. Thus, the two data sets are differently synchronous, i.e., have different time scales.
As another example of different time scales for input data sets, in one data set the data sample periods may be non-periodic, producing asynchronous data, while another data set may be periodic or synchronous, e.g., hourly. These two data sets may not be useful together as input to the SVM while their time-dependencies, i.e., their time scales, differ. In another example of data sets with differing time scales, one data set may have a “hole” in the data, as described above, compared to another set, i.e., some data may be missing on one of the data sets. The presence of the hole may be considered to be an asynchronous or anomalous time interval in the data set, and thus may be considered to have an asynchronous or inhomogeneous time scale.
In yet another example of different time scales for input data sets, two data sets may have two different respective time scales, e.g., an hourly basis and a 15 minute basis. The desired time scale for input data to the SVM may have a third basis, e.g., daily.
While the issues above have been described with respect to time-dependent data, i.e., where the independent variable of the data is time, t, these same issues may arise with different independent variables. In other words, instead of data being dependent upon time, e.g., D(t), the data may be dependent upon some other variable, e.g., D(x).
In addition to data retrieved over different time periods, data may also be taken on different machines in different locations with different operating systems and quite different data formats. It is essential to be able to read all of these different data formats, keeping track of the data values and the timestamps of the data, and to store both the data values and the timestamps for future use. It is a formidable task to retrieve these data, keeping track of the timestamp information, and to read it into an internal data format (e.g., a spreadsheet) so that the data may be time merged.
Inherent delays in a system is another issue which may affect the use of time-dependent data. For example, in a chemical processing system, a flow meter output may provide data at time t0 at a given value. However, a given change in flow resulting in a different reading on the flow meter may not affect the output for a predetermined delay τ. In order to predict the output, this flow meter output must be input to the support vector machine at a delay equal to τ. This must also be accounted for in the training of the support vector machine. Thus, the timeline of the data must be reconciled with the timeline of the process. In generating data that account for time delays, it has been postulated that it may be possible to generate a table of data that comprises both original data and delayed data. This may necessitate a significant amount of storage in order to store all of the delayed data and all of the original data, wherein only the delayed data are utilized. Further, in order to change the value of the delay, an entirely new set of input data must be generated from the original set.
Thus, improved systems and methods for preprocessing data for training and/or operating a support vector machine are desired.