Many data scientists do not have the tools they need to operate more efficiently because the tools they need to perform their tasks are currently unavailable to the public at large. Data scientists leverage large quantities of historical data to develop statistical models that can be used to predict future user behaviors, events, and outcomes. Because the data scientists' statistical models can be integrated into software services to perform system-critical functions, any hindrances in the development and deployment of statistical model updates can be inconvenient and even disastrous to the software services provider and to the customers of the software services.
A long-standing need in the data science community exists to facilitate the development, sharing, and deployment of their statistical models. Unfortunately, the available tools that are best-suited for developing statistical models are poorly suited for deploying the models into the production environments that the models will be operating in. As a remedy, data scientists can use third party tools to translate source code from a development language into a production language and/or into an executable file for a production environment, however, the use of third party tools comes with a big drawback.
The problem with deploying code through third party tools is that there is limited or no visibility or assurances that the translation process is being performed correctly and/or is maintaining the intricate relationships between the data transformations defined by the source code. A similar problem exists with production environments that directly accept the source code in the development language. The risks associated with having a black box translate source code for direct deployment of software services to tens of millions of potential customers is so great that data scientist are typically not permitted to release their models directly to a production environment. As a result, data scientists' models are carefully tested, translated, and/or otherwise handled by software engineers. The detriment of this model of deployment is the introduction of latency between model development and deployment. Further complicating matters, some of the production computing environments have to be manually configured to get the resources of the production computing environment to support the operation of the models.
Because the work of data scientists can impact the functionality of complex software services, any hindrances, difficulties, or redundant steps in developing and deploying model updates can become roadblocks to deploying updates, fixes, or otherwise resolving the potential revenue generating hindrances to proper operation of a software service.
What is needed is a method and system for developing and deploying data science transformations from a development computing environment into a production computing environment, according to various embodiments.