Machine learning combines techniques from statistics and artificial intelligence to create algorithms that can learn from empirical data and generalize to solve problems in various domains such as natural language processing, financial fraud detection, terrorism threat level detection, human health diagnosis and the like. In recent years, more and more raw data that can potentially be utilized for machine learning models is being collected from a large variety of sources, such as sensors of various kinds, web server logs, social media services, financial transaction records, security cameras, and the like.
Traditionally, expertise in statistics and in artificial intelligence has been a prerequisite for developing and using machine learning models. For many business analysts and even for highly qualified subject matter experts, the difficulty of acquiring such expertise is sometimes too high a barrier to be able to take full advantage of the large amounts of data potentially available to make improved business predictions and decisions. Furthermore, many machine learning techniques can be computationally intensive, and in at least some cases it can be hard to predict exactly how much computing power may be required for various phases of the techniques. Given such unpredictability, it may not always be advisable or viable for business organizations to build out their own machine learning computational facilities.
The quality of the results obtained from machine learning algorithms may depend on how well the empirical data used for training the models captures key relationships among different variables represented in the data, and on how effectively and efficiently these relationships can be identified. Depending on the nature of the problem that is to be solved using machine learning, very large data sets may have to be analyzed in order to be able to make accurate predictions. As part of the typical workflow for developing and using predictive machine learning models, a data set may be split into a training subset and a test subset. A model may be trained to predict the values of a target or output variable using the values of corresponding input variables of the training subset, while the test subset may be used to evaluate the quality of the predictions made for “new” observation records which were not used for training the model. If the values of the target variables happen to be distributed differently in the test subset than they are in the training subset, the evaluation of the model may make the quality of the model appear to be worse than it should. This in turn may lead to more resources being consumed in attempts to re-train and re-test the model to achieve higher quality, which may in some cases substantially increase the overall cost of generating models for use in production mode.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to. When used in the claims, the term “or” is used as an inclusive or and not as an exclusive or. For example, the phrase “at least one of x, y, or z” means any one of x, y, and z, as well as any combination thereof.