Sampling or subsetting is a data management technique to extract a representative sample of a primary or production dataset for the purpose of deploying the representative sample on a smaller test, development, or research cluster. Mechanisms exist to sample a single dataset, such as a table or a file; however, applying those mechanisms to multiple datasets, such as all tables in a database, does not guarantee a “consistent” sample, i.e., a sample that honors the referential constraints between the datasets.
As an example, consider a typical data model involving two tables, named Customers and Orders, where the Customers table stores personal information for known customers, while the Orders table stores information about customer orders over time. A relationship between the tables is established by means of a customer_id attribute that is defined in the Customers table and referenced in the Orders table for every order placed by the customer. Thus, the customer_id attribute acts as a referential constraint for the Orders table.
Given these two tables, simply taking a 10% sample of each of the two tables to obtain a sampled dataset will not guarantee that for every row in the sampled Orders table there will be a corresponding row in the sampled Customers table within the sampled dataset. Hence, the sampled dataset cannot be said to be consistent.