Data warehousing databases were developed to meet a growing demand for management information and analysis that could not be met by operational database systems. While operational systems are optimized for simplicity and speed of modification (e.g., online transaction processing) through heavy use of database normalization and an entity-relationship model, the data warehouse is optimized for reporting and analysis (e.g., online analytical processing). As a result, operational systems were unable to meet the need for management information for such reasons as the increased processing load of reporting, which negatively impacted the response time of the operational systems. Moreover, development of reports in operational systems often required writing specific computer programs, which was slow and expensive.
The critical factor leading to the use of a data warehouse is desire for a data analyst to be able to perform complex queries and analyses (e.g., data mining) on the information without slowing down the operational systems. There are many advantages to using a data warehouse such as the enhanced end-user access to a wide variety of data and increased ability of business decision makers to obtain various kinds of trend reports (e.g., the item with the most sales in a geographical area for the last two years). Advantageously, a data warehouse can be a significant enabler of commercial business applications, most notably customer relationship management applications.
A critical process in any data warehousing scenario is the Data Extraction, Transformation, and Load (ETL) Process. ETL involves such tasks as extracting data from outside sources, transforming it to fit business needs, and ultimately loading it into the data warehouse. While an ETL process can be created using almost any programming language, creating them from scratch is quite complex, requiring high levels of technical expertise, with little opportunity to reuse existing code bases. Increasingly, companies have been seeking ETL tools to help in the creation of ETL processes. One example of such a tool is MICROSOFT's SQL Server Integration Services (SSIS), which provides a platform to build data integration and workflow applications.
In SSIS, the core ETL functions are performed within ‘Data Flow Tasks’. A Data Flow Task is an SSIS executable element responsible for transferring and transforming data between heterogeneous data sources. It is the most important among SSIS executable elements and it has its own complex logical structure with a specialized object model. A Data Flow consists of components and paths that connect the two. Together, paths and components build the data flow execution graph that controls the flow of data. The state of this graph is controlled by a module called data flow (or Pipeline) layout.
Accordingly, data flows in SSIS are built using components that define the sources that data comes from, the destinations that the data gets loaded to, and the transformations applied to data during the transfer. To be used in a dataflow, these source, destination, and transformation components have to be configured by defining the associated metadata. A complete data flow describes the flow of data by specifying one or more ‘source’ points that start the flow, operations downstream of the source points that operate on the data or direct the flow (e.g., merging or splitting the flow), and one or more destination points that ‘sink’ the data at the ends of the flow.
However, like previous data warehousing platforms, the existing SSIS data flow architecture does not offer the possibility of reusing parts of previously configured data flow components except through a crude copy and paste functionality. For example, the copy and paste functionality alone is not sufficient to address the reusability problem, because it does not offer any capability to manage pieces of reusable logic. Furthermore, even with the ability to copy and paste portions of reusable logic, additional remapping of metadata and manual fix-ups are still required, which does not mitigate the expense and required level of expertise to rapidly develop new data flows. Additionally, complex data flows could easily consist of hundreds of components. Currently, those components have to be placed within a single data flow and there is no way to group or modularize them.
Another problem with existing architectures is the inability to address data flow scenarios where only partial portions of the flow are known, such as in the case of a reusable logic portion for example. Such reusable logic portions can have one or more source or destination points that are unknown or are unavailable, can have one or more operations within the flow that are unknown, or any combination of these conditions. Such incomplete data flows cannot be executed on their own. For example, there is no convenient and consistent way to provide standalone execution of such incomplete data flows without significant effort to provide test input sources and output destinations or provide manual fix-ups.
Accordingly, in consideration of the complexity, costs, and high level of expertise required to reuse portions of ETL logic, it would be desirable to provide a convenient way to generate, manage, and reuse modular portions of dataflow logic. It is further desirable to provide a way for standalone execution such incomplete data flows. These and other deficiencies in the state of the art of data warehousing tools will become apparent from the description of the various exemplary non-limiting embodiments of the invention set forth below.