In a world increasingly shaped by analytics, data science (DS), machine learning (ML) and artificial intelligence (AI) techniques that are readily available, value and competitive differentiation often stem from the data that is available for processing. In many domains, for reasons including but not limited to business, operational, legal, regulatory, security and privacy concerns, it is desirable to guarantee certain invariants during data processing.
For example, the HIPAA Privacy Rule refers to protected health information (PHI) as “individually identifiable health information.” Entities that handle PHI are subject to a number of business and operational restrictions. In order to avoid such restrictions, it may be desirable to assert that, if data inputs are not PHI, then at no point during processing will PHI be created. There are similar examples in other domains involving potentially sensitive information including but not limited to identifiable information (II) (a superset of the traditionally narrow personally identifiable information (PII)) as well as consumption data (be that of physical or virtual goods, services or content), location data, communications data, social graph information, government records, vehicle telematics, blockchain-related information, etc.
When processing potentially sensitive data, there are a number of conventional approaches to ensuring the absence of criticality involving a combination of three techniques. (1) Data sanitization: Ahead of data processing, the data is pre-processed to ascertain various properties such as a certain level of k-anonymity or the absence of certain type of fields in the data. This may be combined with a safe harbor-type attestation by the entities involved in processing the data. (2) Expert determination: An expert uses statistical or scientific principles to ascertain with high degree of certainty that criticality has not been achieved and/or will not be achieved during data processing. (3) Externalizing critical operations: Operations that involve criticality are executed elsewhere, often at a separate business entity.
Each approach has its drawbacks. For example, data sanitation is typically done in preparation of performing multiple operations on the data. As a result, the data is typically over-sanitized via omission, redaction, randomization, coding and related techniques. Further, this data sanitization affects the quality of output of data processing operations such as training machine learning and AI models. Expert determination focuses on the data being processed as well as the systems, controls and workflows for doing the data processing. It is usually the case that a sample of data is analyzed by the expert(s) to reach a determination. When the data materially changes in either breadth or depth, a new expert determination is required. This introduces cost and friction as new data may not be readily usable until a new determination is achieved. Externalizing critical operations adds cost and complexity.
What is needed are techniques and supporting systems that avoid criticality and avoid these and other drawbacks inherent in current approaches.