1. Field
Embodiments of the invention generally relate to data processing. And more specifically, embodiments are related to techniques for key-break and record-loop processing in parallel data transformation in a parallel data processing system.
2. Description of the Related Art
A common challenge for many modern computer environments is managing large volumes of data. Various systems and software applications exist today for managing and processing large amounts of data. These tools are very useful for data processing in a broad variety of fields, including web portals, medical applications, financial applications, and web applications, to name but a few examples.
For example, one such application has been created by International Business Machines (IBM) under the name InfoSphere® DataStage®. The DataStage® software application is an extract, transform and load (ETL) utility that is part of the IBM Information Server suite. The DataStage® application features a high performance parallel framework and supports the collection, integration and transformation of large volumes of data. Specifically, DataStage® software uses a pipeline model for processing one record of a data volume at a time, in each stage of the pipeline. Data flow in such a pipeline may be acyclic, allowing data to only flow in one direction. As each stage finishes processing a record, it may pass the record to the next stage in the pipeline for further processing. For example, one stage may be a transformer stage, through which users may modify, add or remove data in a record of a data volume, one record at a time, while another stage may be a data source stage, which reads the records in from a source data volume. In this example, the data source stage may read a record from the data volume, and then pass the record to the transformer stage for processing.
Additionally, the DataStage® application may use data-partitioned parallelism to increase performance in certain stages. That is, input data may be partitioned or re-partitioned as needed while flowing through the pipeline. If a stage is able to process records in parallel, input records to that stage are first partitioned, and then the data on each partition is processed by an instance of that stage. For example, data entering the transformer stage may be partitioned into four partitions, and then four instances of the transformer may process the partitioned data, with each stage processing a separate partition of input data. By taking advantage of data-partitioned parallelism, the DataStage® application is able to process records more efficiently and effectively.
Although the parallel nature of pipeline processing gives applications such as DataStage® increased performance, it also leads to additional challenges, as, due to their pipelined nature, these applications process the data volume one record at a time. For example, certain transformer operations may require information about subsequent records in the data volume. However, this information may be unavailable until those subsequent records are processed by the transformer. For instance, in a data volume containing multiple groups of records, it can be inefficient to calculate each record's percentage contribution to the record's group, because a user cannot determine if a particular record is the last record in a group without reading the subsequent record. As such, users must currently use multiple operations in user programs in order to attempt to work around this deficiency.
Likewise, some transformer operations may require information about previously-processed records. For instance, in an aggregation function, a user may wish to calculate the sum of all the records in a particular group. However, because data in such a pipeline is processed one record at a time, users may be unable to look “backwards” to previously-processed records. For example, if a user wished to calculate the sum of all the records in a particular group, users currently must use multiple operators in a user program in order to solve a simple aggregation problem. The result is additional complexity and ineffiencies both in the user program and in the transformation stage.