This specification relates to machine learning systems.
Machine learning refers to techniques for using computing systems to train predictive models that use past training examples to predict the outcome of future events that are similarly situated as the training examples. For example, machine learning can be used to train a predictive model, or for brevity, model, that predicts the market value of a house given particular attributes of the house, e.g., square footage, ZIP code, etc. The attributes are referred to as features of the model. A collection of features associated with a single data point used to train the model is referred to as a training example.
Many large enterprises and government institutions utilize machine learning to generate predictions of many different types of phenomena, e.g., oil demand for a particular region, the incidence rate of the flu virus in January, and the likelihood that a prospective borrower is likely to default on a mortgage.
Generating predictive models can be expensive and time consuming. For a particular engineering team within an enterprise to generate a predictive model, the team must design the model and then acquire training data for training the model. Acquiring training data is often expensive because acquiring good examples and labeling the examples is time-consuming and tedious.
The efforts of multiple teams within an organization are often duplicated in designing and generating models. For example, multiple different teams within an organization may require the same type of model, and the multiple teams may each develop their own approach and expend their own effort acquiring and labeling training examples. While this may produce some models that are better than others, the under-performing models are simply subpar models that incurred duplicated effort. In other words, there is no way for these teams to collaborate or for better performing teams to share their work with lesser performing teams.
In addition, teams generally lack the ability to manage the model generation process from one iteration to the next in a disciplined way. Rather, the models are changed and retrained in an ad hoc manner. Retraining is often computationally expensive. Thus, undisciplined retraining of models further needlessly expends system resources.
Furthermore, when to retrain a model is inconsistent from one team to the next. For example, important data that changes the outcome of a housing price model, e.g., the onset of a recession, may cause only some teams to retrain their models when in reality, it would be best for the organization if all teams retrained their models in the face of such changes.
These widespread problems result in an environment in which it is difficult or impossible for management of an organization to determine which models are used by which teams, which teams design better models, and how to help teams that do not design and maintain good models.