Software applications that work in conjunction with often large, complex databases are widely know. For purposes of this disclosure, such an application is referred to as a database-centric application (DCA). It is not uncommon today for developers of DCAs to outsource the testing of a DCA to a testing entity that can perform the necessary testing more efficiently than the developer. In these situations, the testing entity is provided with an executable or object code version of the DCA along with a copy of a database comprising “real world” data for use in testing the DCA. So long as the information in the database used for this purpose does not include confidential information, the process can work well. Even in those instances where the database comprises confidential information, a sufficient level of trust in the relationship between the DCA developer and the testing entity may exist such that the developer is willing to share the otherwise confidential database with the testing entity. However, recent development of more stringent privacy laws and regulations have made it increasingly difficult for developers to share confidential information. The resulting lack of meaningful test data makes testing of the DCA difficult at best.
One solution to this problem is to create databases of “synthetic” data having the same schema as the original (confidential) database, but having fake data values therein. While this may be useful in some circumstances, such fake data typically fails to appreciate the meaning of data values or capture otherwise complex semantic relationships between data values. As a result, testing based on synthetic data is seldom, if ever, as effective (e.g., in the sense of providing more complete testing coverage of the DCA and/or in finding bugs within the DCA) as testing with real-world data.
Another solution is for the developer to maintain so-called “clean rooms” that are physically secure, on-premise environments used to store the confidential databases and run the DCA under test. In this approach, personnel from the testing entity are brought to the developer's clean room to execute the necessary testing while being subjected to intense monitoring. Obviously, this approach is cumbersome and often undermines the very efficiencies that motivated use of the testing entity in the first place.
A more common approach is to anonymize confidential databases in order to protect private information prior to providing them to a testing entity. An example of this is illustrated in FIG. 1. In particular, FIG. 1 illustrates a simple database 102 having a number of attributes (i.e., Name, Age, Zip Code, Nationality and Disease) and various records comprising specific data values in accordance with this schema. In general, attributes in a schema may be classified as a confidential attribute (e.g., Disease), an identifier attribute (e.g., Name) or a quasi-identifier attribute (e.g., Age, Zip Code, Nationality). A confidential attribute encompasses data that is considered sensitive and not to be publicly associated with a particular person or entity, whereas an identifier attribute encompasses data that is sufficient by itself to identify a particular person or entity. A quasi-identifier attribute encompasses data that by itself is insufficient to identify a person or entity but that is otherwise useful for such identification when combined with other quasi-identifier attributes or external (typically publicly available) data. For example, in FIG. 1, if it is known that there is only one Palaun living in the 51000 zip code, exposure of the database 102 even with identifier attributes (e.g. Name) completely suppressed would allow one to infer that Ann Able is afflicted with dyspepsia.
In a typical embodiment, an anonymizer 104, capable of generalizing or suppressing data within the various records, operates upon the database 102 to provide a fully anonymized database. Techniques for anonymizing data, such as k-anonymity (in which identifiers/quasi-identifiers in each record are altered to ensured each record is indistinguishable from at least k−1 other records) or L-diversity (in which effectiveness of external data is diminished by “distributing” sensitive data across groups of otherwise anonymized people/entities) are well known in the art. For example, as illustrated in FIG. 1, the database 102 has been subjected to k-anonymization where k=2. In this case, the identifier attribute (Name) has been suppressed entirely, whereas the quasi-identifier attributes (Age, Zip Code, Nationality) have been altered in some fashion to generalize or otherwise mask the quasi-identifier attribute data.
While such anonymization can ensure confidentiality of the data, it suffers from similar issues as synthetic data in that the meaning and relationships in the original data can be lost entirely, thereby significantly reducing the effectiveness of the anonymized data during testing. For example, and again with reference to FIG. 1, code that is designed to look for known data values in the Nationality attribute (e.g., American, Japanese, Palaun, etc.) is likely to generate an inordinate number of errors (or exceptions, as they are commonly called) due to the fact that this data has been generalized to Human, which is not a valid value for the code under test. Furthermore, the code that would normally be exercised during testing with the actual Nationality values would go untested as a result, thereby reducing the test coverage of the code.
A more sophisticated approach is selective anonymization, where quasi-identifiers are identified such that anonymization techniques are applied to at least some of the quasi-identifiers. By appropriately selecting the quasi-identifiers to be anonymized, balance may be achieved between the conflicting goals of ensuring confidentiality while retaining usefulness of the data for testing purposes. For this approach to be effective, however, detailed knowledge of how specific quasi-identifiers are used in a given DCA is needed in order to ensure maximum test coverage while simultaneously ensuring the desired level of confidentiality, which knowledge is not always readily attainable. To this end, different data anonymization approaches use different heuristics regarding how to select attributes as quasi-identifiers. For example, a popular heuristics for the well-known Datafly, k-anonymity algorithm is to select attributes that have a large number of distinct values, whereas the Mondrian algorithm advocates selection of attributes with the biggest range of values. While useful, these heuristics fail to capture how DCAs actually use values of attributes in order to maintain test coverage.
Thus, it would be advantageous to provide techniques that improve upon the current state of the art with regard to the anonymization of databases for testing of DCAs.