1. Field of the Invention
The invention relates generally to transformation of data read from a data source so that it is in the proper form to be stored in a data sink and more particularly to transformation of data that is organized in one manner in the data source so that it can be stored in a data sink where it is organized in another manner.
2. Description of Related Art
One of the most valuable assets that any organization has is the data that it accumulates as a result of its operations. Nowadays, an organization stores its data in databases maintained by database systems. An organization of any size at all will have a number of different databases. The reason for this is that each database has an organization peculiar to itself and different uses of data require different organizations of the data. For example, a database which a company uses when dealing with its customers is termed an operational data base. Operational databases are designed to perform transactions: that is, they record when an order is received from a customer, when the order is shipped, when the company is paid, and so on. The operational database is thus organized in a fashion which permits such transactions to be performed as rapidly and reliably as possible.
Another use of databases in an organization is to store data for analysis. The ultimate source of much of this data is the operational databases, but if the data is to be useful for analysis, it must be organized in the database in a fashion which permits the analyst to easily find out what he or she wants to know. One class of such analytical databases is termed data warehouses; the data warehouse contains data from all of the organization""s operations and is organized in a fashion which permits global analysis. Another class of such analytical databases is termed data marts; a data mart contains information that is useful for a particular functional area of the organization. The data in a data mart may come from the data warehouse or the operational data or other sources, but it is organized in the way that is most useful for the functional area that uses the data.
As is apparent from the foregoing, many organizations now have a number of differently-organized databases in which much of the information is the same. It has thus become necessary for these organizations to move data from a source database in which it is organized in one way into one or more sink databases in which it is organized in different ways. Because the data in the source database is organized differently from the data in the sink database, it is heterogeneous to the sink database. In the course of the move, the data must be transformed so that it is in the proper form for the sink database and is therefore no longer heterogeneous to it.
Data may be heterogeneous to a given database in two ways: it may be schematically heterogeneous and it may be semantically heterogeneous. Data is schematically heterogeneous when it is organized according to a different model from the one used in the given database, for example, when it is organized to be quickly accessible by customer account number rather than by an interval of time. Data is semantically heterogeneous when the data values have different semantics and structural representations from the ones used in the given database. For example, one database may represent monetary amounts as integer numbers of pennies, while another may represent them as fixed decimal numbers.
Schematic heterogeneity results in two kinds of conflicts between the source and sink databases: naming conflicts and structural conflicts. Naming conflicts include synonym and homonym problems with names of both tables and columns within tables; structural conflicts result when the structure in which a given piece of information appears in the source database is different from the way in which the given piece of information appears in the sink database. For example, the information which may need to be copied from the source to the sink may be a list of date-stock price pairs for the stock of Oracle Corporation. In the source database, the list may be in a table called xe2x80x9cOraclexe2x80x9d that has two columns: one for the dates and the other for the stock prices. In the sink database, the list of oracle stock prices may be part of a stock prices table that includes stock prices of other companies, with columns for the date, the name of the stock, and the stock price. In another sink database, the stock prices table may have columns for the date and for each company, with the price for a given company on a given date being in the company""s column in the row in which the given date appears in the date column.
Semantic heterogeneity results when the data values in the source have different representations in the sink. The different representations may simply be different names, for example, xe2x80x9cOraclexe2x80x9d represents Oracle Corporation in the source, while xe2x80x9cOracle Corp. represents it in the sink, they may be different measurement scales, for instance prices in U.S. $ in the source and prices in Japanese ¥ in the sink, or decimal stock prices in the source and fractional stock prices in the sink, or different measurements, for example, latest stock closing price in one database versus latest trade price in the other, or different granularity of measurement (for example, monthly salary versus yearly salary), and classic data type conflicts. For example, a serial number may be represented as an integer in one database and as a character string in the other. Even where the representations are the same, there may be size differences. One database may use a 10-character string for the serial number, while another may use a 16-character string.
As will be apparent from the foregoing, data transformation is a necessary and complex part of copying information from one database to another. One way to do data transformation is to write a special program to perform the transformation. The problem with this approach is that the person writing the program must be an expert in the source and sink database systems, in the structures of the of the source and sink databases, in the semantics of the data being transformed, and in the programming language used for the special program. Writing transformation programs is thus strictly a matter for data processing experts, with all of the concomitant expense and delay.
The need for data processing experts was reduced by the development of a data transformation system by Sagent Technology, Inc. This system is described in the paper, Colin J. White, Building a corporate information system: the role of the datamart, which is dated February, 1997 and was available in December, 1999 at the Sagent Technology Web site, www.sagent com. FIG. 1 gives an overview of this system 101: System 101 permits a user to copy data stored in data source 111 to data sink 113 while doing the necessary transformation. The user defines the transformation using a set of pre-defined data transforms that are supplied with system 101. Included in these transforms are a number of source read transforms 115 for reading data from different kinds of data sources 111, a number of intermediate transforms 117, which define different transformation operations for dealing with inhomogeneities between the data in source database 111 and the data as it is stored at 120 in data sink 113, and a number of write transforms 119 for writing data to different kinds of data sinks 113.
To make the transformation required to copy data from source 111 to sink 113, the user specifies a data flow plan 107 which is a sequence of the transforms. The first is the source read transform 115 required for the data source. For example, if the data source is a relational data base, source read transform 115 is a simple SELECT query (no WHERE clauses) running on a real or virtual table defined in source database 111. Then come as many of the intermediate transforms as it appears to the user are required to perform the transformation. Each intermediate transform receives data from its predecessor transform and provides the data, transformed as specified by the transform, to the successor transform. Finally, when the data is in the form required for data sink 113, the sink write transform required for data sink 113 does the writing. The intermediate transforms include join and union transforms, which perform join and union operations on the outputs of two predecessor transforms, and split transforms, which split the input they receive into output for two successor transforms.
In the Sagent system, the user specifies the sequence of transforms by means of a graphical user interface 105 in which each transform is represented by an icon. To make a data flow plan, the user simply arranges the icons in the order in which the transforms are to occur. Data flow plan maker 104 takes the sequence of icons specified by the user and makes from it data flow plan 107, which is code that can be executed by flow plan execution engine 109. Flow plan execution engine 109 performs the transforms specified in flow plan 107 in the order in which they are specified in the flow plan, beginning with the read operation specified in read transform 115 and ending with the write to sink operation specified in write transform I19.
While the Sagent system""s technique of specifying a transformation as a sequence of predefined transforms makes it possible for people who understand only where the data is in the source, what transforms are necessary, and where the transformed data is to go in the sink to copy data from a source 111 to a sink 113, the data flow, plans 107 produced using the technique have a number of problems:
They take no advantage whatever of the powerful data processing capabilities of modern database systems. This has a number of costs:
The database systems have been specifically designed to do the operations that are typically specified in the transforms. In many cases, data source 111 or data sink 113 can do much of the work involved in a transformation much more efficiently than flow plan execution engine 109 can.
Since all of the transforms are done in flow plan execution 109, all of the data must be moved from data source 111 to execution engine 109. Engine 109 and data source 111 are typically on different platforms, so the move must be done via a slow cross-platform link.
data is passed from one transform to the next long after it is required for the transformation.
Each transform has a significant amount of overhead, but the Sagent system makes no attempt to reduce the number of transforms.
Flow plan execution engine 109 will process transforms in parallel, but data dependencies among transforms prevent system 101 from taking full advantage of the performance improvements offered by parallel processing.
The efficiency of the flow plan depends completely on the capability of the user who makes it. There may be a better way to do the transformation than the user chooses, but the system simply does what the user specifies.
It is an object of the invention disclosed herein to solve the foregoing problems and to thereby provide a system for transforming which is as easy to use as system 101, but more efficient.
The invention disclosed herein provides solutions to the problems of presently-available data transformation systems by providing a data transformation system with a data flow optimizer that optimizes the data flow plan to make it more efficient. The optimized data flow plan produced by the optimizer is equivalent to the original data flow plan but has fewer transforms than the original data flow plan.
The optimizations performed by the optimizer include read/write optimization and merge optimization. Read/write optimization is used when the data source or sink is a database system or other system which can perform operations on the data it contains. In this kind of optimization, the optimizer finds transforms in the original data flow plan that specify operations that can be performed in the source or sink and then replaces these transforms and the simple read or write transform of the original data flow plan with a read or write transform that specifies that operations equivalent to those of the replaced transforms be done in the data source or sink. Advantages of read/write optimization in addition to reducing the number of transforms include using the often highly efficient facilities of the source or sink system to perform the operation and reducing the amount of data that is moved from the source to be processed by the data transformation system.
Merge optimization combines intermediate transforms into a single merge transform that specifies operations equivalent to those specified in the combined transforms. The optimizer further constructs the merge transform in such a fashion that operations in the merge transform that are independent of each other can be performed in parallel. Merge optimization thus not only reduces the number of transforms, but also increases the amount of processing that can be done in parallel and reduces the amount of data that is moved through the transformation. A given data flow plan may have merge optimization, read/write optimization, or both applied to it. In one aspect of the invention, the original data flow plan continues to be displayed in the user interface after optimization, i.e., the optimization is transparent to the user.
In another aspect of the invention, the optimizer includes a data flow plan analyzer that examines the transforms of the data flow plan to determine whether they are optimizable. If the analyzer finds that a transform is optimizable, it passes properties of the transform to a transform optimizer. The transform optimizer collects the properties for optimizable transforms until the analyzer finds the next transform that is not optimizable. Then the transform optimizer produces an optimized read transform, an optimized write transform, or a merge transform from the sequence of transforms, depending on the kind of optimization being done and replaces the sequence of transforms with the optimized transform. The analyzer may increase the amount of optimization that can be done by reordering the transforms of the original data flow plan to produce. longer sequences of optimizable transforms.