This application relates generally to analyzing data using machine learning algorithms to develop prediction models for generalization, and more particularly to cross validation of machine learning algorithms on distributed database systems.
Companies and other enterprises acquire and store large amounts of data and other information relevant to their operations, generally in large distributed databases. Successful companies may acquire, explore, analyze and manipulate the data in order to search for facts and insights that characterize the data and lead to new business opportunities and leverage for their strategies. Analyzing large amounts of data to gain insight into the data so it may be used for generalization and prediction is a complex task.
One approach to characterizing data is to use supervised learning. Supervised learning is a machine-implemented approach to analyzing a set of representative training data to produce an inferred function or model from the data that can be used with a prediction function for generalization or prediction on another set of similar data. The training data is generally a subset of the data set comprising training samples that are analyzed using a computer executing a supervised learning algorithm to produce the inferred function or model. Different models may be used with the training and prediction functions, and a metric function is used to measure the differences between the values predicted by the prediction function using the models and the actual values. The metric function measures the performance of the models. The supervised learning algorithm creates the models for the data using the training samples. The objective is to produce a model that results in the smallest difference between the predicted values and the real values. However, a supervised learning model typically has parameters that cannot be fitted using the training data through this process, and other methods are needed to fix the values of these parameters.
Cross-validation is an approach for assessing how the results of a statistical analysis will generalize to an independent data set. It is useful in prediction applications to estimate how accurately a predictive model will perform in practice. Cross-validation comprises partitioning a sample of data into complementary subsets, performing an analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set). To reduce variability, multiple rounds of cross-validation may be performed using different partitions, and the validation results of the multiple rounds averaged.
Supervised learning and cross-validation require processes such as executing training, prediction and metric functions that query languages such as Structural Query Language (SQL) and the like generally cannot perform, and these processes normally cannot run directly within a database. It is desirable to provide systems and methods that afford a framework that operates within a database to execute such functions directly on stored data and produce measurements of model performance for multiple sets of values and for one or more sets of model parameters. It is to these ends that the present invention is directed.