Many data management interfaces and tools have been developed to help data scientists analyze data sets. For instance, graphing tools can be used to visually represent relative magnitudes of data stored in tabular form.
Some data sets are very complex and are stored in disparate formats and locations. Sometimes, due to human error, the data has also been entered incorrectly or inconsistently. These complexities and inconsistencies can make it difficult to intuitively process and understand the correlations that exist between the underlying data. Accordingly, it is sometimes necessary to transform the data into a more unified and comprehensible form before it can be properly analyzed.
Data scientists transform the data with discrete tasks. These tasks, which are also referred to as transforms, can include simple algorithms such as multiplication or addition. Other tasks are more complicated. For instance, some tasks are used to parse complex strings of data or to split, normalize, merge, reformat or perform other complex transformations on the data.
It is common for data scientists to apply a plurality of tasks to complex data sets before the underlying data is ultimately transformed into the desired form for final analysis. However, it can be a difficult and time consuming process to identify the appropriate tasks to be applied to the data, as well as to identify the specific order for applying desired tasks. In particular, the data scientists may not be familiar with all of the different transforms that are available for use with their data management software. Sometimes, it can also be difficult to know how one task might negatively impact another task in a transformation pipeline.
For instance, by way of example, if a data scientist is trying to merge two tables having similar data, but the data in corresponding columns is not in the exact same format (e.g., addresses being presented in different formats), the scientist might invoke a normalization transform to facilitate the merge. However, this normalization could have an unintended consequence of reducing the data to a lowest common denominator (e.g., a format that eliminates the zip code for some of the addresses, if other addresses are already missing a zip code). As a result of this process, certain content might be omitted that would otherwise be required to perform a subsequent task, such as a task for graphing sales associated with the different addresses on a region granularity (e.g., based on zip code).
The foregoing example is only a simple illustration of how one task might have an undesired consequence on another task. Other examples include changing data from one type to another type that may not be compatible (e.g., changing dates to percentages or strings to integers) and which may render the data incomprehensible for subsequent processes.
For very complex data sets, data scientists are often required to iteratively apply different tasks in different combinations to determine whether each task and combination of tasks is appropriate and/or compatible. It will be appreciated, that this iteration can consume significant amounts of time and computer processing. This waste in computing resources is even more pronounced when considering that it is often necessary to redundantly perform the same processes for designing the same or similar sequences of tasks to be applied to different domains, e.g., to different data sets or through different applications.
Accordingly, there continues to be an ongoing need for improved systems and tools for facilitating the identification and application of tasks to be used for performing desired transformations on data sets.