The invention is directed to an approach for implementing parallel transformations of data records.
A “transformation” of data refers to the process of converting or manipulating data from one form or state to another. There are many types of transformations that may occur to data in computing systems.
For example, a common type of transformation is to calculate a “checksum” for a given set of data. A checksum is based upon any sort of algorithm that transforms a set of data into a value that can be used to verify the integrity of the data that it describes. In general, the checksum is based upon a numerical determination or a type of “summing” of a set or sequence of the bits that make up the data. If that data later becomes corrupt in some way, e.g., some of the bits are “flipped,” then the checksum of the corrupt data will not match the checksum of the original data.
Compression is another example of a commonly used transformation. Compression refers the process of encoding information in a manner that reduces the bandwidth or storage requirements of that data.
Yet another example of a common transformation is an encryption algorithm. Encryption refers to the process of converting one form of data into a non-open or cipher-based form of data. The ordinary goal of the encryption-type transformation is to prevent any but intended recipients of the encrypted data from being able to legibly understand or access the data.
While very useful, there could be efficiency concerns with specific implementations of transformation algorithms. For example, consider the application of a transformation to a set of ordered records that are to be written to the same data unit. In this circumstance, the approach of sequentially performing the transformation upon the ordered records could result in severe performance bottlenecks.
To address this and other problems, the present invention provides an improved approach for implementing transformations of data records. According to some embodiments, parallelization of transformations is performed against the data records. For checksums, record generators compute the checksum for a newly generated record before copying into shared memory. Subsequent generators may aggregate integrity checksums for data records into checksums for data units incrementally. For incompletely-aggregated data units, final aggregations may be performed before the data units are written to persistent storage. The checksum is stored at a well-known location with respect to the data unit—the checksum could be stored either outside to the data unit or inside the data unit, e.g., in a block header.
Other and additional objects, features, and advantages of the invention are described in the detailed description, figures, and claims.