Data Wrangling is the process of converting or mapping data from one raw form into another format so it is readily consumable for analytics, such as cleaning unstructured data to columnar format. For example, during data wrangling, the user may want to split a date time value into two separate columns, format the date in a specific way, or even remove the time portion of the value to save space. Another example of data wrangling is merging log file data with user metadata so that the background of the user who is executing the action can be understood.
On a large dataset on the scale of petabytes, there is a problem of how to create a smart representative sample of the dataset that will take into account the trade-offs between time and quality. It is important to create sample dataset that is a subset of the real dataset because it is not physically possible to store the entire data set on a single desktop machine. A self-service user ideally does not want to wait days for sample data set to be produced before beginning to create wrangling operations. It is important to get a quality representative set of sample data to perform operations on so that the user does not waste time with multiple iterations of the scheduled job. For example, if the user is only sampling the first file in a directory that represents log files from the first day of the month and this log file does not contain any logged errors, this could cause logic errors when creating the wrangling operations. The format of a logged error value would be unexpected and cause the wrong wrangled output to be generated.
There is also a problem with how to effectively communicate to the user that the wrangling operations and the visualizations are executed on sampled data, as opposed to the complete dataset. For example, it may be detrimental for a data analyst to start sharing charts that are based off sampled data with his or her colleagues or for a data scientist to start implementing predictive algorithms when data in both scenarios does not include the full dataset. However, the user might want to use the same analytics tool to get a feel of how the visualization would look with the real data.