Some embodiments of the present disclosure are directed to an improved approach for implementing referentially-complete data subsetting using relational databases.
The term “data subsetting” refers to the process of creating a partial copy of a given database that comprises only a subset of the data in the given database, yet contains the portions of the data needed for the intended purposes of the smaller copy. The motivations for creating the smaller copy are varied, including at least (for example) when a company is split to two separate operating units and only portions of the given database are applicable to one or the other of the operating units. Another motivation is (for example) when ensuring that a given “production” database can fit into/onto a smaller target computing platform.
Generally, it is intended that the partial copy, even though smaller, nevertheless be referentially complete within itself. The referential completeness of the database ensures that there are no dangling references in the data, e.g., applications that rely on referential completeness of the database perform as intended using the partial copy.
Unfortunately, prior attempts at data subsetting have suffered from many deficiencies and/or side-effects. The legacy examples below highlight these deficiencies:                Subset Size Cutoff/Limits: In this legacy technique, a subset is created by assuming an arbitrary percentage of the original number of rows. This technique does not take into account data relationship cardinality. Frequently, because of variations in data distribution, it is not possible to estimate the impact of picking up an arbitrarily-limited number or percentage of rows in a table.        Blind Subtraction: Legacy deletion techniques merely make a copy of the database and delete rows. This technique can be very slow, especially when deleting rows requires undo and/or redo logging operations. Also, this process requires the setup of a shadow production copy which can consume the same amount of space as the production copy itself.        Remote Copy Creation: A subset execution model is characterized by copying the rows to a test system over JDBC (Java Database Connectivity) or other database connections. This method requires that data be moved a few rows at a time and it does not minimize redo and undo logging.        
Moreover, the aforementioned technologies do not necessarily perform in a manner so as to create a referentially-complete data subset. Therefore, there is a need for an improved approach.