Large data sets may exist in various sizes and organizational structures. With big data comprising data sets as large as ever, the volume of data collected incident to the increased popularity of online and electronic transactions continues to grow. For example, billions of records (also referred to as rows) and hundreds of thousands of columns worth of data may populate a single table. The large volume of data may be collected in a raw, unstructured, and undescriptive format in some instances. However, traditional relational databases may not be capable of sufficiently handling the size of the tables that big data creates.
As a result, the massive amounts of data in big data sets may be stored in numerous different types of data storage. Sensitive data may be copied and stored in various locations across the different types of data storage for various use cases. Tracking the sensitive data may be difficult as users may copy and distribute data. Typically, raw data is mapped or derived into output data. The mapped values may change as the raw source data changes. The change in raw data may result in sensitive data, such as personally identifying information (PII), appearing unexpectedly in output columns or files. Varying source data contents may thus impede tracking sensitive data in a big data environment.