1. Field
The present disclosure relates to data processing, particularly to a database, and more particularly to a system and method for generating a test workload for a database.
2. Description of the Related Art
Currently, a huge amount of applications are running on data stored in database management systems. With increasing complexity of software, how to ensure high quality of these applications becomes a critical issue. And it is not just the applications which get more complex, but also the database management systems themselves. Those database management systems often store mission-critical data which is accessed and updated by many different applications and thousands of users. An outage of the database management system has often serious consequences to the core business—it can cost a company millions of dollars, lost of trust by customers and business partners as well as legal implications. In the worst case it can even put a company out of business. Therefore it's inevitable and mission critical to provide sufficient database testing methodologies. It's not just critical for companies who are database end-users to have easy access to such database testing methodologies but also for application middleware vendors as well as the database vendors themselves.
A critical point of a database-centric testing is the availability of “real-world” workloads since it is extremely difficult to artificially generate sufficient data and access patterns of a production system due to the complexity of the applications and environments today.
There is also an increasing need to make such real-world workloads and the corresponding testing tools more available for third parties. For example a company might out-source its application database testing to a service provider, which is also known as “Test as a Service”. This raises data confidentiality and security concerns. A company is most likely not willing to expose its core business data to non-trusted parties. But even inside the company data access is heavily restricted. People who develop and test application usually don't have access to the real business data.
Several solutions for collecting and/or generating workloads exist already today, but they are limited in their usage, either due to data confidentiality issues or because of insufficient modeling of the real-world environment. In addition many of those solutions lack the ability to correlate database data and statements which disqualifies such solutions for a real end-to-end testing scenario
Application-driven testing requires re-creating the, mostly complex, production system in a test environment. This approach requires huge efforts in time and resources to re-build the environment. Such a solution is not very portable since the production environments are very unique and hard to re-build. In addition test data generation can be an issue as well since the data may still contain confidential information if derived from the production system. Finally the simulation of the real-work workload is a problem since appropriate application drivers need to be hand-crafted
Another approach is the so-called “Capture and Replay”. Here the communication between the application and the database is intercepted and recorded. Later on the recorded statement flow can be replayed on a database image taken when the recording started. However, this solution requires that the data store used for the replay represents the exact same state (including all the data) as during the capture phase. This makes the solution not very portable and available due to confidentiality issues and the “locked-in” environment which makes it hard to test migration and scale-up/down scenarios. It's difficult to deviate from the recorded flow. In addition such capture and replay solutions will not work on artificially generated data.
Currently, there are several different approaches to create database test data as follows:
1) Random test data generation. The generated data usually makes no sense, as a result, it is almost impossible to design a good test workload based on the data to achieve test goals, neither from a functional nor from a performance perspective.
2) Data masking. Many products offer data masking functions. However, data masking is usually only used within a customer site because with a large amount of database objects, it is not practical to completely ensure privacy information and it's not considered to be safe sharing the masked data with third parties outside the customer's security and privacy boundaries. In addition it requires time and resources to learn and work with the masking tool to generate the appropriate masks. Meanwhile, data masking may break the order of data so that some queries in the original workload cannot be used for testing directly.
3) Profiling and populating script. Testers write their own data generation script based on predefined test strategies. For example, testers generate a profile recording table sizes or column distinct values by running profiling scripts, then insert data into the target system according to the profile. This approach requires testers to learn the application very well and to spend a significant amount of time on the test data generation project.