Computerized devices and systems control almost every aspect of our life, both as individuals and as a society. Many of the computerized systems gather or use significant amounts of data about products, processes, individuals, and other entities. The data is typically organized for modeling relevant aspects of reality, in a manner that supports processes requiring this information. The data is often stored in the form of a database, wherein the term database may refer to the way users view the data collection, or to the logical and physical materialization of the data, in files, computer memory, or computerized storage.
In some situations, a deadlock may be faced, wherein the development and particularly the testing and proofing of applications require the existence of sufficient data, otherwise certain functionalities cannot be tested. However, generating the data required for testing and populating a database with such information, may require the existence of the application itself. Even further, the data contents, structure and requirements may be non-final and may evolve throughout the development of the application.
Some methods provide for generating data for testing an application. One method relates to manually fabricating data. However, such operation may require significant manual labor and may thus be inefficient and infeasible for obtaining a large corpus of data. Furthermore, fabricated data may be non-realistic, inconsistent or meaningless, or at least may have distributions or other properties which are significantly different than those of real life data based on real scenarios and population.
In some cases, data may exist but may be inaccessible to an application developer, due to laws, privacy protection regulations, or other limitations such as organizational policy. For example, sensitive health or financial data, even if such exist, may be restricted and cannot be shared with application developers or QA staff members, whether such personnel belongs to the organization maintaining the data or are external to the organization.
If data exists but is inaccessible due to privacy limitations, using masking or scrambling to hide sensitive details may not always suffice. For example, data may be exposed when transferred to another location, or some sensitive data may leak due to mistakes, bugs or malicious actions. In other cases, if the total volume of the data that is available is relatively small, masking some identifying details may not be enough to conceal the identity of subjects or other entities.
Other data generation methods may relate to automatic generation of constraint-based random data. However, such methods may be infeasible or inefficient for large applications with a multiplicity of constraints.
Yet other methods relate to random data generation, which may provide irrelevant and useless data which does not represent real-world data and does not comply with the relevant constraints.
All the above-mentioned methods may be employed, but even if useful data is generated, it may still not be easily extended, updated or improved when more data is required, when the requirements change or when the real data to be used by the application changes.