Nowadays, typical software products employed by individuals or enterprises are very complex and oftentimes bugs or performance issues are encountered only after the software has been used for a certain time. Once encountered, the user will report the bug to the provider of the software product, e.g. to the provider's support department, in order for the bug to be fixed.
In order for the software provider to reproduce the encountered bug, it is often necessary that the software provider operates on the actual data which was processed by the software product when the bug occurred, because the issue to be reproduced highly depends on the structure and validity of the user's data running through the buggy software product. However, the users are often not willing to send their data to the software provider's support department, if the data is confidential and must not be seen by the software provider.
In the prior art, certain products are known which might be used in this context. For example, Oracle's so-called “Data-Pump” enables a user to plug-in a self-written function into Oracle's database system to modify the data (Data Pump Data Remapping). Another example is the product ARTS business architect of applicant which provides a report functionality (internal java script functions based on a public API) to make the data of a user's ARTS installation anonymous. As a further example, the user might export his data from his software installation, for example as an XML file, which could then be transformed with the help of XSLT transformations. While the above approaches could be used by the user to obscure the actual data before it is sent to the software provider, i.e. to anonymized the confidential data, these approaches are not very flexible to changing requirements and also involve a lot of effort, since the way how the data should be modified must in some cases be hard-coded by the user in a self-written function and heavily relies on the user's database schema.
Furthermore, US 2011/0060905 A1 discloses systems and methods for providing anonymized user profile data. In this disclosure, confidential user data, such as names and addresses, are anonymized in order to be usable for personalized advertising. While the anonymized data might be helpful for tailoring advertisements to the user, the anonymized data is obscured in such a way that it is not usable in the context of the present invention, since the anonymized data does not allow to reproduce bugs occurred in a user's software product.
It is therefore the technical problem underlying the present invention to provide an approach for anonymizing data in such a manner that confidential parts thereof remain securely protected, while the anonymized data can still be investigated in a meaningful manner, thereby at least partly overcoming the above explained disadvantages of the prior art.
This problem is according to one aspect of the invention solved by a computer-implemented method of anonymizing data of a database. In the embodiment of claim 1, the method comprises the following steps:    a. exporting at least one data record from the database, wherein the data record has a structure and comprises content; and    b. anonymizing at least part of the content to produce at least one anonymized data record;    c. wherein the anonymized data record has the same structure than the data record read from the database.
Within the scope of the present invention, the term “anonymizing” is to be understood in the sense of converting a given piece of data into a form which does not allow to derive the original content of the data.
Accordingly, the above embodiment defines an approach for anonymizing data in a particularly intelligent manner, namely such that the data, yet anonymized, can still be investigated and analysed in a meaningful manner. This is because the method preserves the structure of the original data while anonymizing the data content.
As s simple example, consider a data record in the database whose structure defines two data fields: name and address. The content of the name field is “John Doe” and the content of the address field is “Elm Street”. The anonymizing process of the present invention produces an anonymized data record in which the name “John Doe” is anonymized e.g. to “ABC” and the address “Elm Street” is anonymized e.g. to XYZ. Nevertheless, the present invention preserves the structure of the original data record, i.e. it is still possible to identify that the anonymized data record comprises a name field and an address field. This way, the present invention departs from known approaches, such as disclosed in the above-cited US 2011/0060905 A1, in which a name/address tuple is anonymized into a single encrypted identifier, i.e. the structure of the original data is lost during the anonymizing process.
In one aspect of the present invention, the step of anonymizing is performed during the step of exporting, so that no confidential content is stored outside of the database during the exporting process. Accordingly, the anonymizing functionality is encapsulated within the export functionality, which has two advantages: firstly, the anonymizing algorithm cannot be changed or manipulated from the outside. Secondly, the confidential data does not leave the database in such a manner that it would be (persistently or temporarily) stored outside of the database.
Accordingly, this aspect provides a particular high degree of security and data confidentiality.
In another aspect of the invention, the step of anonymizing may comprise generating a random encryption key, anonymizing at least part of the content to produce at least one anonymized data record using the random encryption key, and deleting the random encryption key. Accordingly, The means for anonymizing the content (the encryption key) is exclusively generated for each particular run of the anonymizing process and destroyed immediately afterwards. This ensures that the anonymized data cannot be decrypted in order to derive the original data.
Moreover, the method may comprise the further step of selecting which part of the content is to be anonymized, wherein the step of anonymizing comprises anonymizing only the selected content. Accordingly, not all of the content of a given data record is necessarily anonymized, but the part to be anonymized may be selected (e.g. by a user). In particular if the data record comprises a mix of confidential and uncritical content, this aspect greatly increases the performance of the anonymizing process, since only the necessary minimum of content might be anonymized. This in turn saves processing resources of the underlying system executing the anonymizing process.
Preferably, the step of anonymizing is performed in a deterministic manner, so that the anonymizing of a given part of the content always results in the same anonymized content. This is an important characteristic of some embodiments of the present invention and ensures that relationships between the data fields of the data records are preserved during the anonymizing process, as will be explained in more detail in the detailed description. To achieve the above-described deterministic behavior, the step of anonymizing may be performed using a cryptographic hash function, preferably the Secure Hash Algorithm (SHA). Alternatively or additionally, the step of anonymizing may be performed using a random anonymizing process and using a cache to remember already created anonymized content, which will be explained in more detail further below.
According to yet another aspect of the invention, the content to be anonymized adheres to at least one data type and wherein the step of anonymizing preserves the validity of the anonymized content in accordance with the at least one data type. For example, if a data field of the original data record stores email addresses, it is ensured that also the anonymized data record, with the content of the email address being anonymized, still indicates that the anonymized content relates to an email address.
The step of anonymizing may be performed using one or more predefined transformation rules, which might be provided in the form of code annotations and/or in the form of a configuration file, in particular an XML file (see the detailed description below).
The above aspects of the present invention are particularly advantageous if the at least one data record comprises confidential data, wherein the corresponding anonymized data record is usable for being investigated while preserving the confidentiality of the confidential data. As will be explained in more detail further below, the concepts of the present invention may in this way be used e.g. for a software product provider to analyze and investigate bugs in the software product without being able to see the actual (confidential) data of the user.
The present invention also refers to a system for anonymizing data of a database, wherein the system comprises an exporter component, adapted for exporting at least one data record from the database, wherein the data record has a structure and comprises content, an anonymizer component, adapted for anonymizing at least part of the content to produce at least one anonymized data record, wherein the anonymized data record has the same structure than the data record read from the database. Further advantageous modifications of embodiments of the system of the invention are defined in further dependent claims. Lastly, the present invention might also be provided in the form of a computer program comprising instructions for implementing any of the methods disclosed herein.