In the handling of large data sets (what is frequently referred to as “big data”), the work of preparing data sets for analysis and/or for presentation in reports and/or visualizations can consume more time and/or more processing resources than the work of either of the analyses or the generation of presentations. As the size and number of data sets continues to increase, the correspondingly increasing variety of uses for data sets brings about a growing variety of data preparation operations that may need to be performed and each data preparation operation takes ever longer to perform. As a result, bottlenecks may occur in the preparation of data sets that may greatly delay the availability of properly prepared data sets for subsequent analysis and/or presentation operations.
This has rendered such past practices as choosing to regularly perform a selected battery of data preparation operations on every data set, regardless of which data preparation operations are actually needed, increasingly unfeasible. The task of determining what data preparation operations actually need to be performed on each data set has become increasingly important.
Unfortunately, the increasing size of data sets also increases the difficulty in relying on personnel to manually select the data preparation operations that are to be performed on each data set. Manually inspecting even a large enough portion of data set to identify what data preparation operations are needed becomes increasingly difficult and requires ever more time per data set. Additionally, the increasing variety of data preparation operations that may need to be performed to accommodate an increasing variety of uses for data sets can become overwhelming.