Machine learning and artificial intelligence are increasingly important tools for automating and improving processes. For example, such tools may be used for fraud detection, operations monitoring, natural language processing and interactions, and environmental analysis. Many of these cases requires fitting (e.g., training) increasingly sophisticated models to large datasets containing millions of data units, each with hundreds of features (e.g., “big data”). However, by relying on large datasets, model modification and data updates require time and processor intensive re-training.
In the related art, certain libraries (e.g., MLib for Apache Spark or H2O ML platform) effectively implement a limited number of models on large datasets. However, such libraries only support a limited range of model types, which are not calibrated to all problems. Therefore, if such libraries are used outside of this limited range, they may provide false results, which can cause compound errors. For many problems, richer model structures are required to properly fit particular datasets and provide valuable analysis. Currently, custom models may be developed for particular use cases, but such customization requires extensive fitting to the dataset, which is a very expensive process (in terms of time and processor power) and not easily parallelizable within distributed computing.
Accordingly, certain there is a need for improved systems and methods to process a wide range of customized models at scale. Certain aspects of the present disclosure provide a distributed computing framework for executing arbitrary models at “big data” scale. Moreover, some aspects of the present disclosure provide tools for building generative models (e.g., Bayesian generative models) flexibly.