This invention relates to data processing tools for assisting users in decision making processes. Data Warehousing is a tool used for providing users with information needed in various decision making processes, for example, in various types of businesses and other types of organizational environments. In order to provide current information, the data warehouse must be continuously updated and consolidated to properly reflect a current state of a business or organizational entity. Data from multiple sources is extracted, transformed and loaded into the data warehouse at various times in a process. Typically, the transformation steps are modeled as data flows, which can be implemented as SQL scripts, shell scripts, application programs, and other types of command execution. Frequently, in an enterprise, for a complete end-to-end solution, such data flow executions are also interspersed with other existing processes, for example, legacy or third party processes.
Businesses and organizations often rely on a service-oriented approach to support their processes. A programming language that is used in this service-oriented approach is BPEL4WS (Business Process Execution Language for Web Services). This standard language is supported by a variety of runtime engines, and enables the orchestration of web services. Common data warehousing scenarios, such as the above mentioned extract, transform and load (ETL) scenario, are not well supported in BPEL-based processes, as the current BPEL language does not provide for a native definition of data integration processing elements.
A BPEL process is a flow of activities, where BPEL branching semantics indicate sequenced or parallel execution of the activities, including handling errors in activity executions. A BPEL process can have variables, which are used as either input or output parameters in the context of an activity's invocation. Several BPEL flow level and activity level attributes can exist. For example, groups of activities can be part of the same transaction. Other attributes include settings, such as whether transactional processing is managed by a BPEL engine, or whether individual activities manage their own transactions.
In contrast to BPEL processes, whose focus is the coordination of activities among service providers, data warehouse processes usually need to handle significant amounts of data. A data warehouse process includes a set of steps to extract, transform and load data, which are repeatedly performed depending on predefined control conditions. Data movement and transformation steps use database functionality and script files, as well as custom programs, to process the data. A data warehouse process can be scheduled (or triggered), and there is usually only a single instance of a data warehouse process running at any specific time. Error recovery in transformational processing is critical and is done using the transactional processing capability of SQL engines, or by running specific cleanup flows to “undo” partial processing.
Having a service-oriented approach, based on standards such as BPEL, is essential for enterprise component and process integration, especially when multiple parties are involved. With a service-oriented approach the distinctions between control processes (that is, BPEL processes) and data (that is, warehouse transformation) processes becomes less clear, especially from an end user's or administrator's perspective. Importantly, all the constituent activities, including the data transformation activities, must participate in a unified transactional context as well, to enable proper error recovery and restarts of failed processes, and so on. However, BPEL provides little support for data transformation activities. Hence, it would be advantageous to have a seamless integration of process steps and data integration steps to enable a combination of activities as flexibly as needed, while avoiding having to rely on multiple technology solutions for solving similar problems.
One solution for integrating a data flow into a BPEL-based process is to expose a data flow job through a web service. This approach, however, requires a separate web service interface for each data flow job, leading to a proliferation of web services in complex ETL scenarios and thus to extensive administration requirements. Furthermore, from an invocation point of view, web services are remote, which adds overhead for invocations of data transformations that have the potential of being handled locally.
Another possible solution is based on extensions to the BPEL-standard to support various high-level programming languages. For example, BPEL4Java systems enable BPEL process developers to embed Java™ code into BPEL, such as Java™ conditional expressions for evaluating branches or for invoking Java™ methods. However, these generic programming related solutions do not pertain to typical data transformation scenarios, especially since ETL jobs are usually executed on runtime (remote) systems.
Often, data transformations are performed using SQL. Thus, yet another possible solution for including data integration processing into a process is to make use of a native SQL-extension approach in the BPEL language. This solution has the advantage of integrating the data directly into the process, but also implies that already existing data transformation flows would need to be re-implemented as SQL embedded inside BPEL. There is also a dependency on a particular runtime environment as the BPEL engine must support this non-standard extension, and the data processing activities are limited to SQL. In typical situations, external programs and other non-SQL mechanisms are extensively used to complete even the most trivial of data warehouse processing. Also, the “dialect” of SQL used varies between different vendors, which makes it difficult to provide a standardized and useful extension to BPEL.