Computing systems are used to store and retrieve data. Due to the proliferation of data sources and the connected nature of computing systems, tracking where data originates and how it was processed has become extremely difficult. The data generated by computing systems is typically stored in a database in a format defined by the processing application. For example, a computing system used to process health records will store the data in a format that allows a health practitioner to retrieve and process patient data. There are a multitude of computing systems that are designed to process health records but each computing system may not store the health records in the same format. When data is collected from each source system, it is often critical to the collector that the data abide by format and business rules. For example, pharmaceutical data must comply with standards before it can be submitted to the FDA. Determining compliance not only requires verifying format but knowledge of how data was processed as well as data sources. Recording the data sources and processing steps for data as it moves through the data supply chain is called data tracking.
Because multiple systems may be involved in producing, processing and sharing data, data tracking has become a necessity. Sharing data requires that the storage and exchange format be standardized. Standards exist to expedite analysis or allow sharing. Standards also exist to protect privacy and ensure safety. Personal or financial data must be tracked to comply with privacy laws. Financial data must be tracked to comply with accounting rules and regulations.
Although standards are necessary for sharing data, they can create challenges for data tracking. Computing systems that need to share data are not necessarily produced by the same vendor. If multiple entities are involved in data sharing, each entity can create their own standard. Even when multiple entities agree on a standard, multiple revisions are necessary as the standard is refined. Furthermore, in computing systems designed for scientific research, the discovery nature of science necessitates creation of new domains to be added to the standard. Thus creating standardized data if often a multistep process with multiple versions of data at each step. As the number of steps increases, data tracking becomes increasingly difficult.
In order to track data, many approaches and tools have been utilized. Datasets can be manually converted and transferred to comply with standards and regulations. Data describing the source of data, and the type of processing and data standards utilized is called metadata. Metadata is critical to data tracking, but in most current systems, it is manually recorded. When metadata is manually entered, either before or after the data is transferred and converted, it can get out of sync with the datasets. In other words, the metadata may not actually reflect what was performed on the data. Accurate data tracking requires knowledge of what metadata corresponds to a particular dataset.
Extraction Transformation and Loading (ETL) programs are used to convert data. ETL programs are either manually written using a programming language or created using an ETL creation tool. An ETL creation tool is good for automatically creating ETL programs that conform to common conversion patterns. The ETL tool user selects pre built conversion building blocks and manually fills in specific parameters. For example, a user may select a building block that writes data into a database. The user manually fills in the data source connection parameters and how data will be mapped from a source to target dataset. Although the ETL program is then created automatically, it is still up to the ETL programmer to manually record what ETL program was used on every resultant dataset. Datasets may go through a series of cascading conversions and validations. Each step of the series requires a different ETL program. For data tracking purposes, the ETL program must be recorded and related to the dataset. This is especially important if a dataset fails validation at a final step. In order to determine which conversion step introduced invalid data, a mechanism to retrieve the ETL program and the resultant datasets is necessary. For example, if a dataset has invalid data, the previous dataset and the latest conversion must be examined. This is problematic because each conversion step may have multiple ETL versions. The multiple versions may be due to data irregularities or variations of business rules implemented. In current practice, the ETL program associated to each step is manually recorded. Due to the complexity of manual recording processes, datasets can get out of sync with the ETL program.
Currently systems exist that allow a user to manually capture metadata describing data conversion activities. These systems are often referred to as metadata management or semantic management systems. These systems allow users to manually enter data describing how data sets are going to be converted or how they data sets were converted. When using these systems, it is up to the user to insure that the conversion programs convert data according to the metadata that was entered. For instance, the metadata may indicate that a data element in a source data set be extracted, undergo a format conversion and then be loaded into a data element in a target dataset. It is up to the ETL programmer to create programs that insure the data element is extracted, converted and loaded according to the metadata entered into the metadata management system.
Systems exist to manage changes to computer programs. These systems are referred to as software revision control or software configuration management systems. These systems manage changes to programs by applying a new version number to a program if it is changed. Revision control systems can be used to track revisions of conversion programs. The problem is these systems are designed to track revisions in computer programs not computer data. These systems were not designed to associate a conversion program and its resultant data.
Methodologies exists that associate metadata to a dataset. Using a system that tracks data workflow, a workflow step could be created that automatically logs metadata information associated with a data set conversion. This methodology can't record metadata at a row or element level. For example, this methodology can store information regarding the processing of an entire dataset but it can't record information regarding the processing of an individual data element such as a patient's blood pressure. Therefore, a system does not exist that automatically applies revision management to data conversion metadata and associated data down to the row and element level.