1. Field
The present invention relates to a method, system, and article of manufacture for discovering transformations applied to a source table to generate a target table.
2. Description of the Related Art
Database users may perform various operations known as transformations on source data tables to produce target data tables. For instance, a business may transform certain financial or customer information into target data by taking the sum or averages of data in a column. Transformations may include string and arithmetic functions, simple column mappings in which the value of a source column is copied unmodified to a target column, aggregates, and user-defined functions.
FIGS. 1a-1j provide examples of transformations known in the art. FIGS. 1a-1h provide examples of value-based correlation, or association transformations that associate values in the source data with values in the target data by means of their actual values. FIG. 1a shows a simple mapping where values of one column from the source dataset are copied without modifications to another column in the target data set. FIG. 1b shows an example of a key lookup transformation where values of one column from the source data set are copied in another column in the target data set via a systematic modification of the value. Each value in the source column is transformed into another value in the target column, and the correspondence between the values may be provided in a lookup table. The example shows that each “0” in the source is converted into a “No” in the target, and each “1” in the source is converted into a “Yes” in the target. FIG. 1c provides an example of a pivot transformation where data stored in several columns of one row in the source data set is converted into a 2-column, multi-row format in the target dataset. In FIG. 1c, each row in the source table contains income values for each quarter for a specific ID. The same data is coded in the target table in a different form. Each row in the target contains the income for a specific ID and a specific quarter, such that the data contained in the first row of the source table is stored in the first 4 rows of the target table.
FIG. 1d illustrates a reverse-pivot transformation, which is the opposite of a pivot transformation, such that data stored in the same column over several rows in the source table is stored in one row and several columns in the target. FIG. 1e illustrates an example of a string concatenation transformation where values of two or more columns of the source data set are concatenated into one single column value in the target data set. FIG. 1f illustrates an example of a string split transformation, which is the opposite of a concatenation, such that a single column value in the source dataset is split into several column values in the target. FIG. 1g illustrates an example of a substring transformation where a subset of a string value in a source column is copied in a column of the target data set. FIG. 1h illustrates an example of a scalar function in which the value of one column in the target data set is computed from the result of a scalar or user-defined function on the values of one or more columns in the source data set.
FIG. 1i illustrates an example of an aggregation type transformation where the column values from several rows from the source data set can be grouped by a key and aggregated value into a single row in the target data set by using an aggregation function. Examples of aggregate functions include sum, mean, maximum, minimum, standard deviation or other statistics that can be computed for a group of values.
FIG. 1j illustrates an example of an arithmetic type transformation, where the arithmetic transformation computes the numeric value in the target column from a polynomial of several numeric column values in the source data set.
An organization or enterprise may hire programmers to write, compile and test the programs that transform source data sets to target data sets. However, over time information on the transformations that were used to produce target data sets may disappear from the organization for a variety of reasons, including poor documentation, loss of the source (uncompiled) version of the software, loss of the developers who wrote the software, or lack of available skills in the programming language (e.g., COBOL). This leaves the enterprise in a precarious position of not being able to maintain, upgrade or migrate critical software programs unless they can recreate the transformations that relate the source data to the target data.
Reconstructing an understanding of the transformations may be undertaken by a person reviewing the source program including the transformation or source and target data. If the source program including the transformation is available, the enterprise can hire an expert to review the source, create documentation, and/or design and implement a migration to a more maintainable software platform (e.g., COBOL to Java). If the source program including the transformation is not available, then typically a person visually inspects the source and target data values and tries to derive the transformations via trial and error. In either case, the cost in terms of time and money is significant, and the manual nature of the process introduces possibilities for misinterpretation and error.
There is a need in the art for improved techniques for determining transformations used to produce target data sets from source data sets.