Field of the Invention
The present invention relates to test data generation for large data sets and more particularly to test generation of data for database testing.
Description of the Related Art
A necessary step in the introduction of any new technology in a database system is to test its behavior across a wide range of operating conditions. This often involves selecting a set of test databases, generating representative query workloads, and executing these workloads on the test databases to evaluate the effect of the new technology. The importance of testing and benchmarking has long been recognized in the database community and there are several standard benchmarks developed for various settings.
While these standard benchmarks serve as useful reference points, there is often a need to generate test databases that satisfy certain properties on (for instance) table size, column domains, skew on columns and correlation between columns. To this end database developers traditionally generate synthetic data that satisfies required properties to adequately test the integrity and functionality of a database.
Of note, modern information systems work with extra large data sets. Thus, despite the sophistication and expected integrity of a database application and the quality of a set of test data created by a developer to test a database application, the proper operation of the database application cannot be assured under real life circumstances. To approach simulation of real life circumstances, testing with an extra large data set is an element of best practices management in testing a database prior to deployment. Yet, access to a reliably large enough data set for use in testing all facets of a database application is not the norm. Rather, customarily, the data for the large data set must be generated in an automated fashion.
Test data generators perform just this function. Generally, a test data generator can be viewed as a utility that generates at the minimum, raw data, and for more sophisticated implementations, raw data, tables, views, and procedures for database testing purposes, performance testing, quality assurance testing, loading tests or usability testing. Integral to the generation of any test data set, however, is the creation of a fact table and a number of dimension tables. As it is well known, a fact table in the field of data warehousing consists of the measurements, metrics or facts of a business process. The fact table is often located at the centre of a star schema or a snowflake schema, surrounded by dimension tables and provide the additive values that act as independent variables by which dimensional attributes are analyzed. Dimension tables, in turn, contain attributes or fields used to constrain and group data when performing data warehousing queries.
In generating data for the different columns of a fact table, random data is selected according to a sequence. In this regard, because the column or columns of the fact table forming a primary key into the fact table must be unique, the sequence used in auto-populating the record fields of those columns must avoid duplication through a cardinality of sequence (the number of values in a sequence before the sequence repeats such as a cardinality of three for the sequence A, B, C, A, B, C, A, B, C or the cardinality of two for the sequence X, Y, X, Y, X, Y) that is too small. The same problem exists for the column or columns of the fact table forming a foreign key into a dimension table. Also, the same problem exists for the column or columns of the fact table used in a table join with other tables.