Data scientists and more broadly “knowledge workers” spend a lot of their time preparing data for analysis. There are estimates that about 80% of processing resources are typically going into data preparation, only 20% into analytics. Thus, there is a trend in the industry to simplify data preparation and integrate it into analytic tools. Data can be shaped, cleansed and enriched by knowledge workers interactively and in the same environment that is used to explore the data and to run analytics.
In such an environment, the data to analyze can be large (e.g., the user might want to analyze a data set containing several million data records). State of the art data preparation technology is not able to support an interactive experience for larger data sets. The cost for such an environment would have prohibitive costs. Thus, interactive data preparation is performed on comparably small samples (e.g., 10,000 records).