Large data sets may exist in various sizes and organizational structures. With big data comprising data sets as large as ever, the volume of data collected incident to the increased popularity of online and electronic transactions continues to grow. For example, billions of records (also referred to as rows) and hundreds of thousands of columns worth of data may populate a single table. The large volume of data may be collected in a raw, unstructured, and undescriptive format in some instances. However, traditional relational databases may not be capable of sufficiently handling the size of the tables that big data creates.
As a result, the massive amounts of data in big data sets may be stored in numerous different data storage formats in various locations to service diverse application parameters and use case parameters. Many of the various data storage formats use a Map/Reduce framework to transform input data into output variables. An output variable may be processed through several layers of transformations before reaching the desired output format. Retracing the layers of transformations for a given variable may be difficult and time consuming Some of the output data may contain and/or be derived from personally identifying information. Furthermore, duplicative output data may be generated, but duplicative output data may be difficult to detect and prevent.