A relational database is a collection of related data that can be represented by two-dimensional tables of columns and rows wherein information can be derived by performing set operations on the tables, such as join, sort, merge, and so on. The data stored in a relational database is typically accessed by way of a user-defined query that is constructed in a query language such as Structured Query Language (SQL).
Often it is useful to extract data from one or more sources, transform the data into some more useful form, and then load the results to a separate destination. A data warehouse, for example, is a central repository for all or significant parts of the data that an entity's various business systems collect and store (often in separate databases), the purpose of the data warehouse being to support data mining, decision support systems (DSS), and other data actions. Data from various sources is selectively extracted and organized on the data warehouse database for use by analytical applications and user queries. Data warehousing emphasizes the capture of data from diverse sources for useful analysis and access.
In the context of a data warehousing, and more generally for managing databases, extract-transform-load (ETL) refers to three separate functions of obtaining, processing, and storing data. The extract function reads data from a specified source database and extracts a desired subset of data. The transform function works with the acquired data—using rules or lookup tables, or creating combinations with other data—to convert it to the desired state as defined by the specific ETL tool. The load function is used to write the resulting data (either all of the subset or just the changes) to a destination database. Various and diverse ETL tools can be used for many purposes, including populating a data warehouse, converting a database of a specific type into a database of another type, or migrating data from one database to another.
In general, ETL tools operate to perform the aforementioned simple three-step process: (a) the ETL tool extracts the data from the source; (b) the ETL tool transforms the data according to its predefined functionality; and (c) the ETL tool loads the data to the destination. However, while basic transformations can be achieved with simple ETL tools, complex transformations require custom development of new ETL tools with specific and complex functionality—an approach that is resource intensive. While simple ETL tools might have broader usability and thus naturally lend themselves to widespread reuse, complex ETL tools do not lend themselves to reusability due to their high customization and narrow utility (and thus the frequent need to custom develop complex ETL tools when they are needed).
U.S. patent application Ser. No. 10/391,726, entitled “SYSTEMS AND METHODS FOR SCHEDULING DATA FLOW EXECUTION BASED ON AN ARBITRARY GRAPH DESCRIBING THE DESIRED DATA FLOW”, filed Mar. 18, 2003 is directed toward database technology that provides users with a means for developing complex transformation functionality that is more efficient than custom development of complex ETL tools. That application discloses a system and method for scheduling data flow execution based on an arbitrary graph describing the desired flow of data from at least one source to at least one destination. The data transformation system (DTS) in one embodiment of the that application comprises a capability to receive data from a data source, a data destination and a capability to store transformed data therein, and a data transformation pipeline (DTP) that constructs complex end-to-end data transformation functionality (data flow executions or DFEs) by pipelining data flowing from one or more sources to one or more destinations through various interconnected nodes (that, when instantiated, become components in the pipeline) for transforming the data as it flows by (where the term transforming is used herein to broadly describe the universe of interactions that can be conducted to, with, by, or on data). Each component in the pipeline possesses specific predefined data transformation functionality, and the logical connections between components define the data flow pathway in an operational sense.
The data transformation pipeline (DTP) enables a user to develop complex end-to-end data transformation functionality (the DFEs) by graphically describing and representing, via a graphical user interface (GUI), a desired data flow from one or more sources to one or more destinations through various interconnected nodes (a graph). Each node in the graph selected by the user and incorporated in the graph represents specific predefined data transformation functionality (each a component), and connections between the nodes (the components) define the data flow pathway.
After the user inputs a graph, the DTP's scheduler traverses the graph and translates the graph into lists of specific work items comprised of a relatively small set of functionality necessary to efficiently obtain data from an external source, route data from transformation process to transformation process (component to component) as reflected in the graph, and then release the resultant data to an external target destination. Despite its name, the scheduler does not schedule work items into time slots, but instead it forms work lists and then manages the operation of the work items in the lists. As such, the scheduler work items comprise the following elements of functionality (each discussed in more detail herein):                obtaining data from a data source        providing data to a component        enabling the split of data along two or more paths        enabling the merger data from two or more paths into a single path        passing data to a thread        waiting for and receiving data from a thread        
DTS also provides a multitude of components with defined inputs and outputs, whereby the user can graphically construct complex data transformations to combine the functionality of the components to achieve the desire end results. These components, similar to a plurality of ETL tools but lacking the individual functionality of ETL tools to extract and load data (as these tasks are handled by the scheduler in the DTP subsystem), provide black box transformation functionality—that is, components can be developed on a variety of platforms (Java, ActiveX, etc.), but the development platform is irrelevant to the DTP as it (and the user) are only concerned about the inputs, outputs, and transformation functionality.
Adding to the efficiency of the system, the DTP also utilizes a unique memory management scheme whereby data extracted from an external source is placed in a memory buffer where it is then manipulated by the components without the need for copying. This technology is discussed in U.S. patent application Ser. No. 10/681,610, entitled “SYSTEMS AND METHODS FOR TRANSFORMING DATA IN BUFFER MEMORY WITHOUT UNNECESSARILY COPYING DATA TO ADDITIONAL MEMORY LOCATIONS”, filed Oct. 8, 2003.