Computational models, particularly process models, are often built to capture complex interrelationships between input parameters and output parameters. Various techniques, such as neural networks, may be used in such models to establish mathematical relationships between input parameters and output parameters. Once the models are established, they may provide predictions of the output parameters based on the input parameters.
One such modeling task may include the construction of a virtual sensor network for a particular type of machinery. Using these applications, an engineer can construct computational models, and use them to analyze the behavior of data or to predict new data. In a modeling system, training data are usually used to select and train the model. Training data may be collected by measurements, or generated by computer simulations. During the modeling process, a plurality of models may be fitted to the training data, and a model that best represents the training data may be selected. Further, independent parameters of the best model may also be determined based on the training data. For example, a normal distribution may be selected as the best statistical representation for an independent parameter to a model for a given set of training data, and independent parameters including a mean and a standard deviation of the normal distribution may be determined for each independent parameter feeding into the model.
In order to construct a good computational model, data quality, data coverage, and data structure of the training data become important. Certain analytical tools have been developed that enable engineers to evaluate the data quality and the data structure. For example, data quality issues may be identified by inspecting the tractability certificates of systems producing the data. Data structure issues may be detected with key fields or time records when attempting to combine data from two or more systems. However, detecting data coverage issues remains an elusive task.
For example, training data that distribute uniformly in the modeling space may lead to an efficient and accurate modeling. However, collected or simulated training data are usually denser in certain regions of the modeling space than in other regions. As a result, such a data coverage condition may lead to models that are insufficiently broad to cover the range of expected inputs, or do not generalize well into underrepresented regions of a solution space. This can have unexpected or unintended consequences for control and diagnostic systems that rely on virtual sensor network technologies. In addition, the modeling process may take an unacceptably long time to converge. Thus, there is a need for modeling systems to evaluate and improve the coverage of the training data, before the data are used for the modeling process.
Systems and methods for processing training data for a statistical classification application is described in U.S. Patent Publication No. 2005/0226495 to Li (“the '495 publication”). According to the method described in the '495 publication, confidence values may be calculated for training data elements to identify the probabilities of the training data elements belonging to identified classes. Further, an interactive scatter plot may be generated using the calculated confidence values and the user may be able to select training data elements from the scatter plot.
Although the method described in the '495 publication may be effective for selecting training data, it may be problematic. For example, the method described in the '495 publication is only able to remove training data elements that are misclassified in the scatter plot, but fails to improve the overall coverage of training data in the modeling space. Furthermore, the method described in the '495 publication can only select training elements from an existing dataset, but fails to provide suggestions of additional cases for generating or obtaining more training data to correct coverage issues. In addition, the method described in the '495 publication relies on calculating confidence values for each data element and generating scatter plot for each selecting process, and thus, may be time and/or computational consuming. For example, the method in the '495 publication may have to generate 2N scatter plots for manual evaluation, where N is the number of independent parameters under consideration. Practical problems in this space could have hundreds of candidate parameters, rendering a manual search such as the one described in the '495 publication impractical
The disclosed system and method for modifying data coverage in modeling systems are directed towards overcoming one or more of the shortcomings set forth above.