Data management interfaces and tools have been developed to help data scientists analyze data. For instance, graphing tools can be used to visually represent underlying data that is stored in a variety of formats and locations.
Some data sets are very complex and/or contain errors. These complexities and inconsistencies can make it difficult to intuitively process and understand the correlations that exist between the underlying data. Accordingly, it is sometimes necessary to transform the data into a more unified and comprehensible form before it can be properly analyzed.
Data scientists transform the data with tasks, which are also referred to as transforms or data transforms. Some tasks include simple data transforms, such as the multiplication or addition of the data. Other tasks are more complicated. For instance, some tasks are used to parse, split, normalize, merge, reformat and/or perform other complex transformations on the data.
It is common for transformation pipelines to be used to process complex data sets. These transformation pipelines include a plurality of tasks that are sequenced for execution in a predetermined order. Many of the tasks are dependent upon the particular attributes and types of data being transformed, as well as the outputs and attributes of related tasks in the transformation pipeline.
During the initial assembly of the transformation pipeline and/or during subsequent analysis of the transformation pipeline, a data scientist might modify the transformation pipeline to change the functionality and/or efficiency of the transformation pipeline. These modifications include resequencing of tasks by adding tasks, deleting tasks and changing the sequenced order of the tasks, each of which can have a significant impact on the overall functionality of the transformation pipeline. A simple example to this point will now be provided.
In this example, a transformation pipeline is created for identifying a total number of unique customers that are listed in two tables. The total number of unique customers will ultimately be determined by counting the total entries (e.g., number of rows or unique entries containing customer names) that remain after merging the tables. In this example, the transformation pipeline includes a first task for normalizing the data (e.g., customer names), a second task exists for merging the tables and entries of customer names, a third task exists for removing the duplicates. The last task is a summation or count of the rows or customer entries that remain in the normalized and merged data set.
If the tasks for merging and normalizing are swapped, it might not impact the final count of unique customers, inasmuch as the subsequent step will remove all of the duplicates that are created during the merge and/or normalization processes, regardless of whether the merge or normalization process is executed first. However, if the task for removing duplicates was sequenced before the merge or normalization tasks, the result might be very different, particularly if the final count includes new duplicates that are created during the subsequent normalizing and/or merging processes.
It will be appreciated that the foregoing example is only a very simple illustration and could, therefore, be easy resolved by a skilled data scientist who understands the best sequence for such a simple set of tasks. However, for more complex transformation pipelines, the data scientist might have to experiment with many different combinations and sequences of tasks in order to determine which sequence and combination of tasks is the most appropriate for a desired result. Unfortunately, any time the transformation pipeline is modified, there is a risk of unintended consequences, such as the creation of incompatibility between one task and another task in the transformation pipeline and/or the target data. Accordingly, it is often necessary for the data scientist to tinker with the selection and sequence of tasks during multiple trial and error sessions before the data scientist can verify that the transformation pipeline has been properly reconfigured. During this process, several alternate variations of the transformation pipeline are executed on the target data set, wasting significant time and computational resources. The foregoing can be particularly problematic when the data scientist may not be able to discover any underlying compatibility and/or functionality problems of the transformation pipeline until after the lengthy processing of the data set is complete and/or the data is later analyzed in subsequent data analysis processes.
The difficulty in diagnosing compatibility issues that arise during the resequencing of transformation pipelines, which can result in the waste of time and computing processes, represents a technical problem the industry is attempting to resolve.