Present invention embodiments relate to information processing, and more specifically, to selectively compressing data fields in a parallel data flow.
Parallel data processing engines (e.g., Extract, Transform, and Load (ETU) engines; MapReduce engines; etc.) typically process large volumes of data (e.g., terabytes). The data is broken up into records, each of which is individually processed by different stages of the engine. Each record is broken into a set of fields, each with its own data type and space requirements. Depending on the processing engine used, data may be passed from one process to another, written to disk, and passed over a network within the same data flow. In order to reduce the overhead for transporting data through a parallel data flow, many processing engines (e.g., HADOOP® MapReduce) compress records or blocks of records as they are written to disk or sent over a network. However, data written many times may require a significant number of compressions and decompressions.