Transmitting files that includes sensitive personal data or confidential information in addition to innocuous or non-sensitive data is common in many sectors, including business, communications, education, and healthcare. For example, records of financial transactions sent between vendors and banking institutions may comprise personally identifiable information and account information along with details of a transaction that includes a payment amount. Medical records sent between healthcare providers and insurance companies may comprise sensitive health information along with general billing codes and procedures. Frequently, files comprising sensitive data are sent via networks and stored in databases residing on a cloud for future retrieval. Unauthorized access to such sensitive data is a concern. In addition, given the ubiquity of the high volume of file transmission needs, efficiency and performance of systems to process these files is a concern.
One approach to address the problem of unauthorized access involves removing or replacing sensitive data from files before transmitting them to a final destination, that is, “stripping” data from files. For example, there may be a need to remove or replace voter information, social security numbers, names, addresses, date of birth, account information, or a variety of personal identifiers. Removal refers to generating a file without strings of characters comprising sensitive information, such as a de-identified file that contains no personal identifiers. Replacement refers to transformation of strings of characters containing the sensitive information into another format that is not sensitive. Replacement methods include encryption or aggregation, in which specific data, such as an exact street address, are replaced by generalized data, such as a postal zip code. Thus, removal and replacement techniques generate files stripped of sensitive information. Such files are referred to as stripped files. Stripped files may consist of different levels of data stripping. That is, stripped files may be partially or completely stripped of sensitive data.
Indeed, regulations commonly impose requirements to remove or replace sensitive data and to store only stripped files. Frequently, this means immediate, real-time data processing to strip sensitive data in a stream of received files. For example, financial industry users may need to meet Payment Card Industry Data Security Standards (PCI-DSS) for storing data originally received in near continuous streams of transactions between merchants and banking institutions. Further, healthcare providers may need to meet Health Insurance Portability and Accountability (HIPPA) standards when transferring patient records between providers or between providers and insurance agencies. These exemplary applications of data stripping raise both security and performance concerns.
Often, one or more dedicated servers follow protocols to process data and route files between end users. Servers strip sensitive data in a near continuous stream of received files. The dedicated servers may receive files comprising sensitive data from one end user and strip the sensitive data before transmitting non-sensitive files to another end user.
Use of traditional, server-based systems for stripping sensitive information can present a security challenge to an organization. Memory blocks on the server comprise sensitive data and file pointers on the server may indicate the memory address of blocks of sensitive data. File pointers and memory blocks may persist at each step of a data stripping process, resulting in a chain of file pointers that may be followed from the file stripped of sensitive data back to the original file comprising sensitive data by unauthorized users. Traditional systems relying on servers may process large quantities of sensitive data on a single server. For example, servers that process credit card transactions and send information between vendors and banking institutions may receive thousands of files comprising sensitive account information about millions of accounts each day. If those servers are compromised, a significant amount of sensitive data may be at risk.
In addition, traditional server-based data processing methods to strip sensitive data from files suffer from limitations in scalability and efficiency. During a surge of received files, server-based data processing may face challenges with process scheduling. That is, server-based data processing may be unable to effectively assign priority of execution, manage load balancing, allocate memory use, predict resource availability, or work within time constraints. During inactive periods in which few files are received, server-based data processing methods may face inefficiencies and unnecessary costs associated with idle capacity. Thus, server-based methods require developers to allocate resources for variable workloads in advance based on a set of potentially inaccurate assumptions.
In view of the shortcomings and problems with traditional methods of stripping sensitive data, an improved system and method for secure, real-file stripping is desired.