Large data sets may exist in various sizes and organizational structures. With big data comprising data sets as large as ever, the volume of data collected incident to the increased popularity of online and electronic transactions continues to grow. For example, billions of records (also referred to as rows) and hundreds of thousands of columns worth of data may populate a single table. The large volume of data may be collected in a raw, unstructured, and undescriptive format in some instances. However, traditional relational databases may not be capable of sufficiently handling the size of the tables that big data creates.
As a result, the massive amounts of data in big data sets may be stored in numerous different data storage formats in various locations to service diverse application parameters and use case parameters. Data variables resulting from complex data transformations (e.g., model scores, risk metrics, etc.) may be central to deriving valuable insight from data driven operation pipelines. Many of the various data storage formats use transformations to convert input data into output variables. These transformations are typically hard coded into systems. As a result, retroactively determining the evolution of individual variables may be difficult, as retracing the layers of transformations for a given variable may be difficult and time consuming. Some of the output data may also contain and/or be derived from personally identifying information. Access to such data may be restricted and layers of derivation may make tracking such data difficult. Furthermore, duplicative output data is frequently generated. Duplicative output data may be generated using processing and storage resources, but the duplicative data may be difficult to detect and prevent.