As computing and communications technologies have advanced, there has been and continues to be a globalization of market places and the expansion of business and governmental enterprises alike. The geographic limitations of sharing data and information across large spans are becoming a distant memory. Moreover, with these technological advances, computing applications which were once slated for use on stand alone machines are being deployed across, what are often, large computing environment networks and platforms. As more data and computing applications become shared, there arises a need to monitor and control the systems that house data to ensure that they are properly functioning and protect against unwanted down time which could translate into lost revenues.
An enterprise's computing environment might contain hundreds of server computers and, possibly, thousands of client computers, all in communication to share applications and application data. Such computing environment might also support vast data stores for use to store application data. Today's data stores, or databases, are designed to operate on a single stand alone machine or among several computing machines (e.g. computer database servers) and cooperate with the computing environment to accept data for storage and/or to provide data to a requesting application. Given the importance of an enterprise's data, significant efforts have and are being made to ensure that the database and applications which use it operate in an optimal manner. One approach is to test the database and/or applications using a pre-defined benchmark. The benchmark, among other things, measures the capacity and operational efficiency of the database and corresponding, cooperating computing applications. Benchmarking a system having a database and application(s) may require the ability to generate repeatable synthetic data to populate the database prior to the test and then selectively regenerate that data during benchmark testing.
Currently, there exists a number of techniques which are used to generate synthetic data for use in benchmarking and other testing activities such as quality assurance testing. One approach is to employ techniques that use a random number generator function to produce random numbers, letters, and/or strings to generate a set of data. Comparatively, a deterministic generator function is one in which the same set of data is generated identically each time the deterministic generator function is executed having a set input.
The drawback with existing practices, however, is that for a data set having N elements, the entire data set must be generated each time a singular data set element requires regeneration. For example, it might be the case that a data set of ten million customer names is generated according to current practices. The customer name consists of a first name and a last name. A deterministic generator function may be used to generate ten million names by randomly picking a first and last name from lists of first an last names. To regenerate the name of the one millionth customer (or the 4 millionth customer, etc.), current practices require the regeneration of all prior names (i.e. 1-999,999 names) in order that the random number generator be positioned at exactly the same point in its sequence. Such practice is extremely inefficient to the point of being impractical.
Consequently, current practices rely on the generation of synthetic data which follows highly predictable patterns. This results in data with regular observable patterns which compromises the realism of the test. For example in the TPC-C benchmark (e.g. an example of current practices), last names are generated by concatenating three syllables chosen from a set of ten. The ten are “BAR, “OUGHT”, “ABLE”, “PRI”, “PRES”, “ESE”, “ANTI”, “CALLY”, “ATION”, and “EING”. With ten syllables used three times, there are 1000 unique combinations, which can be mapped to the values 000 to 999. For example, 000 maps to “BARBARBAR”, and 321 maps to “PRIABLEOUGHT.” Such method provides data which is easily reproduced, but is very unrealistic.
From the foregoing it is appreciated that there exists a need of systems and methods that overcome the prior art.