This disclosure relates to electronic data processing and, more particularly, to a computer-implemented debugging for a data flow associated with an extract, transform and load (ETL) process.
Nowadays many organizations generate vast amounts of data in various formats, e.g. different formats are generated at different locations. Nevertheless, there may be a need to centralize the data such that a top-level or organization-wide evaluation of the data can be performed. For instance, a chain of retail outlets may require the centralization of the sales data from the various outlets, e.g. to evaluate or determine business trends, such that appropriate business strategy decisions can be based on the aggregated data. Many other scenarios are of course well-known.
However, it usually is not a trivial exercise to amalgamate the data from the different outlets, for instance because the data is not in the required format for storage in a large central data base or data warehouse, because the data from different sources contain different formats, because the data from a source may contain spurious date entries that need filtering out, and so on.
To facilitate such data centralization, computer-implemented extract, transform and load (ETL) tools have been developed that automatically extract the data from the various sources, transform the data in user-specified format(s) and load the data into the desired target, e.g. a data warehouse. Such tools typically offer an end user a selection of transformation operations, which the end user can select to define the appropriate transformation operation on the data from selected sources in the form of one or more jobs. In addition, in case of an ETL tool capable of parallel processing of some of the ETL tasks, the user may be able to define in such a job the degree of parallelism, e.g. by defining a data partitioning level, the number of pipelines in order to reduce input-output (I/O) to disk and/or nodes to be used by the ETL tool. Such a tool thus typically creates a connection between data sources and targets, in which the source data is manipulated at the transfer stage before forwarding or storing it into its target, e.g. a data warehouse.
Before the jobs can be routinely executed, it may be necessary to ensure that the ETL process behaves as intended. To this end, the user typically needs some form of debug functionality, e.g. to check some of the (intermediate) data generated in the ETL dataflow. An ETL process may allow a user to insert so-called data station operators into a data flow of an ETL process, in which the data station operator represents a staging point in the data flow. The staging is done to store intermediate processed data for the purpose of e.g., debugging. Although this approach gives the user debugging functionality, it is not particularly practical especially in case of large ETL jobs, which means that a user may have to wait for large parts of the job to complete before the staging point captures the intermediate processed data. This in addition may put additional pressure on intermediate data storage, e.g. disk space on the platform used to execute the ETL process, as large amounts of data may have to be temporarily stored to allow the user to check its accuracy prior to moving it to the target destination.