Enterprises often evaluate various business scenarios to assess and manage their financial, engineering, and operational risks arising from uncertain data. Analyzing risks to make future plans may involve millions of dollars, whereby accurate and efficient simulation of various business scenarios is needed desired to establish the validity of possible decisions in a timely manner.
By way of example, consider an analyst who wants to forecast the risk of running out of processing capacity of a cloud infrastructure. For that, the analyst needs to combine various predictive models for CPU core demands and availability. These models are inherently uncertain due to imprecise prediction of future workload, possible downtime, delayed deployment, and so forth.
One tool for combining various predictive models is based upon probabilistic database systems that use probability distributions and models. Some probabilistic database systems allow users to evaluate queries that combine multiple externally defined models through invocations of stochastic black-box functions (also called variable-generation (VG) functions); queries are evaluated over VG-Functions by Monte Carlo sampling.
A challenge faced by probabilistic database-based simulation systems arises when models are parameterized and the system needs to explore a large parameter space to optimize for a given goal. Returning to the above example, a CPU core availability model may accept a set of candidate purchase dates and apply them according to a model for how long it takes to bring the hardware online. The analyst can then identify purchase dates that minimize the cloud's cost of ownership given a bound on the risk of overload. This is essentially a constrained optimization problem, in which each iteration is an entire probabilistic database query.
A problem with this approach is the repeated (and potentially very costly) invocation of VG-Functions, in that each function is evaluated for most, if not all, possible parameter values, and the function may need to be evaluated over a range of steps (e.g., if it describes time series data, like a daily CPU demand model), and output at each step may be dependent on prior steps. Therefore, with parameterization, even relatively simple scenarios can an unacceptable amount of time in many practical situations where a business decision must be made quickly and/or various parameterized what-if scenarios must be evaluated in an interactive way. In sum, probabilistic database-based simulation systems become extremely slow when models are parameterized and the system is asked to explore a large parameter space to optimize for a given goal. Any solution that makes the process of parameter exploration faster is thus desirable.