With the ever increasing amount of data available to be processed and analyzed and the availability of multi-processor (e.g., Central Processing Unit (CPU)) core systems, more and more data processing applications are being built to process data using parallel data processing techniques. A common technique for processing data in parallel is to partition data into subsets and run many instances of the application's processes or threads to operate on the subsets simultaneously (i.e., in parallel).
Often, the data that was created in one parallel processing system is moved to, or through, one or more other parallel processing systems. In the current state of art, these parallel processing systems typically do their parallel processing in a proprietary manner and interface with other parallel processing systems using a “lowest-common-denominator” type of approach. With such an approach, the first application partitions the data in a first manner (e.g., into 4 partitions), then, the data is serialized and written to a sequential file on disk or passed through a sequential Application Programming Interface (API) to the second application, and the second application re-partitions the data in a second manner (e.g., into 2 partitions). This interaction between parallel processing systems introduces bottlenecks into the overall processing of the data, reducing the benefits gained by using these parallel processing systems.
Some products in the market have developed special functionality to improve on this. For example, one implementation works if the data is moving from application A to application B, but, different implementations are needed to move data from application B to application A or from application A to application C. Thus, when specific functionality is added to allow for parallel data exchange, each interaction is a one-off/custom implementation.
The following example is provided to clarify existing solutions. This example has a set of records forming data. For example, in a data integration scenario, a job is designed to get a two letter option code of each record to a single line for all the records with the same key. First the records are sorted. Then, a task loops through each subgroup of records (grouped by the key), extracts the option code of each of the records, and concatenates the option codes into a single line.
When there are a large number of records, a first application (e.g., an Apache Hadoop® application or a JavaScript® Object Notation (JSON) Query Language (JAQL) application) may be used to perform the sort, while a second application (e.g., an Extract, Transform and Load (ETL) application) is used to perform the looping and option code extraction. (Apache Hadoop is a registered trademark of the Apache Software Foundation in the United States and/or other countries. JavaScript is a registered trademark of Oracle Corporation in the United States and/or other countries.) In processing, the output of the first application will be used as the input for the second application.
To process the records in parallel, the first application partitions the records. For example, the first application partitions the records in n ways, while the second application partitions the records in m ways. The second application reads n streams of input records from the first application. Thus, the second application expects n streams of records from the first application and assumes the records are sorted. The second application does not rely on any constraint or correlation between the n streams. The second application runs m instances of the job (m ways of data partitioning) against the n streams of records generated by the first application.
Although, a typical data set has a large number of records, the following example has been simplified with a small data set for illustration to enhance understanding, while avoiding the complexities of processing a large number of records. The records to be processed include a key, the option code, and other fields. For simplicity, two fields, the key and the option code field, are used to describe this example. The simplified sample data (without showing the large volume) looks like the following.
KeyOption Code3TM3ET1CD1EF2DB3FG3LO3PM3WZ4BW5SV3FV3RH1AB6MU
For this example, assume that the first application partitions the records in 4 ways (n=4) and that the second application partitions the records in 2 ways (m=2).
In phase 1 of the processing, the first application sorts the records. In the above example data, there is skew in the data (there are many rows with the key of 3), but the skew is not manifesting itself in an imbalance because the skewed data is spread over a number of reducers (in a map/reduce system). The following shows the output of the first application with each of the partitions of sorted records:
Partition 1KeyOption Code1AB1CD1EF2DB
Partition 2KeyOption Code3ET3FG3FV3LO
Partition 3KeyOption Code3PM3RH3TM3WZ
Partition 4KeyOption Code4BW5SV6MU
In phase 2 of the processing, the second application receives the partitioned records and loops through each sorted group of records to extract and concatenate the option codes. In this example, the second application runs in 2 ways, getting records output by the first application in two streams. Stream A gets partition 1 and partition 3 records, while stream B gets partition 2 and partition 4 records. Then, the second application generates the following results:
KeyAll Option Codes1AB CD EF2DB3ET FG FV LO3PM RH TM WZ4BW5SV6MU
However, the result is not correct. The correct result should not have one key spread across more than one row (i.e., there are two rows for key 3). The correct result should look like the following:
KeyAll Option Codes1AB CD EF2DB3ET FG FV LO PM RH TM WZ4BW5SV6MU
Because there is no sharing of information about data between applications, passing parallel data from one parallel processing application to another parallel processing application often results in serializing the data and forcing sequential processing by both applications.
If applications make an attempt to take advantage of existing data partitioning, assumptions may be made about the partitioning characteristics, which may lead to errors.