1. Technical Field
The present invention relates to a method and system for de-identifying data and, more particularly, to a computer-implemented method wherein a de-identification protocol is selectively mapped to a business rule at runtime via an ETL tool.
2. Discussion of the Related Art
Across various industries, data (e.g., data related to customers, patients, or suppliers) is shared outside secure corporate boundaries. Various initiatives (e.g., outsourcing tasks, performing tasks off-shore, etc.) have created opportunities for this data to become exposed to unauthorized parties, thereby placing data confidentiality and network security at risk. In many cases, these unauthorized parties do not need the true data value to conduct their job functions. Examples of data requiring de-identification include, but are not limited to, names, addresses, network identifiers, social security numbers and financial data.
Conventional data de-identification techniques are developed manually and implemented independently in an ad hoc and subjective manner for each application. Since it is not possible to consume sensitive fields and information into batch/real time processes, these processes, such as Extract/Transform/Load (ETL), are stand-alone processes in which live data is sourced in batch or real-time. Thus, data requiring de-identification located within a data source is initially discovered and profiled by a discovery tool. The discovery tool output is manually reviewed by a developer, who then defines the data de-identification parameters to apply to the discovered data based on the developer's understanding of the business rules. Specifically, an ETL developer manually identifies various field types, and then maps a single, default de-identification technique to apply to a field type, enabling the ETL process to de-identify those field types. The resulting de-identified data is subsequently delivered to the target environment. As a result, the default de-identification technique is effectively built into the ETL tool at design time. Should a change in the business rule occur, or if different targets have different de-identification requirements, the built-in technique may not be effective to sufficiently de-identify the data.