Embodiments generally relate to providing methods and systems for optimizing the configuration and parameters of a workflow using an evolutionary approach augmented with intelligent learning capabilities using Big Data infrastructure.
The term “Big Data” is generally used to describe the voluminous amount of data, often semi-structured or unstructured, that would take too much time and/or be too costly to load into a traditional database for analysis. Although Big Data doesn't refer to any specific quantity, the term is often used with regard to terabytes or more of data. Often, the goal of a company when attempting to analyze Big Data is to try to discover repeatable business patterns.
Recently, Big Data analysis has been associated with the open source technology Apache Hadoop because the analysis of large datasets requires a software framework, such as “Hadoop MapReduce,” that allows developers to write programs to process large amounts of data in a highly parallel manner Such parallel processing can be distributed among tens, hundreds, or even thousands of computers, and typically involves utilizing workflows that permit users to run a predefined sequence of steps to produce a final result. Each step in the workflow can run specialized algorithms, and each of the algorithms may require configuration, for example, in the form of Boolean, numeric, ordinal, or categorical parameters. Thus, for a workflow that includes many steps in an analysis pipeline, each with many configuration parameters, a large number of unique parameter combinations may exist that could be run, which each produces a different result in the solution space. Scenarios exist wherein a user (such as a researcher) does not know the optimal combination of input parameters across the many steps in the workflow.
To address the challenge of identifying an optimal combination of input parameters, a user could design experiments within a Big Data infrastructure to execute a large number of iterations of the same pipeline in parallel (at the same time), with each instance using slightly different input parameters. For example, evolutionary algorithms and/or approaches exist that attempt to optimize the parameters of an analytic, and these are often referred to as “hill-climbing” algorithms. Evolutionary processes rely on random permutations of parameters for generating new solutions to evaluate to, in effect, stumble upon even better solutions. Such approaches are time consuming, can be expensive, and may not be feasible in some cases given the available computer resources, which may not allow for all possible permutations to be run for every possible parameter combination in the search space.
For example, a traditional genetic software process utilizes a “chromosome” (effectively an array) of values, one value per configurable input parameter. Each chromosome therefore represents a complete set of initial conditions of a workflow to be evaluated. Initially, a random population (collection) of chromosomes is constructed and evaluated by executing the complete workflow with those initial conditions, and a quality score is assigned to each chromosome based on the quality of the output of the workflow generated with that chromosome's initial conditions. The top chromosome(s) are automatically copied to the next generation of the population, so as not to lose the best solution(s). The rest of the population is sampled and randomly mutated (or pairs would be sampled and merged or “crossed-over”) to produce a population of new initial conditions to evaluate. In this sampling process, chromosomes with better scores would be more likely to be selected for the crossover or mutation operations. In a conventional genetic software optimization process, the previous population of chromosomes is discarded after each generation of the evolutionary computation, causing the system to “forget” the vast majority of parameter combinations tried.
The present inventors therefore recognized opportunities for providing methods and systems for providing an intelligent evolutionary process for optimizing workflows that learn what parameter combinations are most likely to produce satisfactory outputs, while also recognizing and noting what parameter changes have a positive, negative, or limited impact on the outcome, to optimize workflows to produce the desired results.