For testing extract, transform and load (ETL) processes or benchmarking new products like Structured Query Language (SQL) engines and ETL tools, availability of simulated high volume data is critical for thorough functional and performance testing. SQL engines need to be tested under various test conditions involving high volume data that is representative of a production query work load.
In many cases, production data contains sensitive data, namely personally identifiable information (PII). This data generally represents information that can be used to identify a customer. PII data may include name, phone number, address, email address, phone number, social security number, account information, etc. Because of security risks, entities are required to protect PII data. For example, PII data is only available to individuals with access and on a limited basis, e.g., need to know. Developers tasked with data analytics in big data environments are generally not part of the group with access to customer PII data.
Most current tools provide functionality to mask sensitive data with blanks or pre-defined pattern (e.g., XXXXXX) before copying into lower test environments. However, masking of sensitive data results in non-testable scenarios because masked data attributes cannot be tested in lower environments. In addition, by masking data attributes, meaningful relationships between data entities are lost and therefore cannot be tested.
Accordingly, the masked production data is oftentimes unusable for testing purposes. Data availability of quality and high volume test data has been a real challenge facing most development teams. This slows down development and quality assurance processes.
Because of the lack of high volume quality test data, developers are unable to perform stress testing in a pre-production environment. This creates a risk in production environments as performance of new releases are unknown.
These and other drawbacks exist.