The invention pertains to digital data processing and, more particularly, for example, to methods and apparatus for the generation of data. The invention has application, by way of non-limiting example, in test harnesses, unit test models, integration testing, black/white box testing, data modeling, and so forth.
Test data generation is a process of generating random or specific data used to test the validity of software coding business logic. In general, during a testing procedure, the generated data is consumed by code that contains the specific business logic that needs to be tested. By using the generated test data, the code containing the business logic is expected to produce a predictable result. Once the code containing the business logic has completed its instructions using the generated test data, one or more assertions are made and compared to the predictable data result. If the assertions return true then the business logic code is deemed to be correct. If the assertions return false, then the business logic code needs correcting.
There are three primary categories of testing from which most other categories of testing derive, unit, integration and functional. All three primary categories require test data in order to assert that the code business logic is producing a predictable result.
Unit testing generally tests very small, specific units of code and normally requires small sets of test data be generated in order to test assertions on the specific unit of code. However, the specific unit of code may be dependant on other units of code having test data generated and executed prior to testing the given unit of code.
Integration testing generally tests larger units of code that are integrated together to produce a predictable data result from combined business logic. Integration tests generally require larger and more complex test data to be generated in order to assert that a combined set of business logic is producing predictable data results.
Functional testing generally tests how a specific portion of a software application interface behaves given a specific set of data. Functional tests may require small to large sets of test data in order to assert that a specific portion of the application interface is behaving as predicated.
In general test data generation for unit, integration and functional testing is done in one of three primary ways: manually, programmatically or via data pruning.
Manually generated test data is created by manually entering test data into a file of some sort via a computer keyboard. In general, this type of test data is entered into a CSV or an Excel spreadsheet. Manually entering test data in general is inefficient, prone to human error, time consuming and limited in the amount and complexity of test data that can be created, thus limiting the amount and complexity of code that can be effectively tested.
Programmatic generated test data is created by executing a software program written to produce a random or specific set of test data. Programs written to produce random or specific sets of test data are only marginally better at producing test data than those done manually. While a given test data generation program may produce larger sets of test data, writing code and maintaining code to generate a specific set of test data, while less error prone is, in general, just as time consuming and limited in complexity as producing test data manually.
Data pruning is the process of generating test data by pruning a subset of data from a production data set. Production data is the resulting data that is produced from live software applications in a production environment. There are many challenges that arise from pruning a production data set, including:                Confidential data (e.g. social security number, credit card number, etc.) must be encrypted, modified, replaced or removed from the data set. Even when software programs are used to prune confidential data, it is often time consuming and may not catch and prune all confidential data from a given data set.        Negative or conditional testing with production data is very difficult at best to accomplish because production data, in general, is not predictable.        Code and business logic may not be testable before said code is introduced into the production environment because production data has not been produced to test the given code business logic.        
In summary and in general, manually generated test data, programmatic specific test data generation and pruning of production data to yield clean test data, all have limiting, finite qualities in their ability to produce and maintain complex test data. These limitations directly and adversely affect the usefulness and validity of unit, integration and functional testing and other derivations of these testing paradigms.