Embodiments of the present invention relate to operation pushdowns and, more specifically, to selective operation pushdowns from an analytics platform to bulk storage.
Analytics platforms or clusters, such as Apache™ Spark™, Apache Hadoop®, and others, are dedicated hardware or software environments for performing analysis on large amounts of data. The use of big data, which refers to large sets of unstructured or semi-structured data, has made such analytics platforms essential in the manipulation of data in modern data centers. Typically, to perform an analytics task, an analytics platform ingests data and processes it locally. In many cases, the bulk of the data is not permanently stored on the analytics platform. Rather, the data is stored in bulk storage, such as OpenStack® Swift, Amazon® Simple Storage Service (S3), or Ceph™ or on a file system or database. Thus, data is migrated onto the analytics platform for analysis.
The operation of migrating from bulk storage to an analytics platform is costly and is typically limited by the bandwidth between the bulk storage and the analytics platform. To relieve this bottleneck, some operations may be pushed from the analytics platform to bulk storage, in an operation referred to as a pushdown or offloading, so that the bulk storage performs these pushed operations.
For example, if analysis is desired only on specific fields (e.g., addresses and phone numbers only), then it likely not useful to migrate other fields of data as well. In that case, a SELECT operation, which selects specific fields from the data, may be pushed down to the bulk storage, such that only the specific fields desired are migrated to the analytics platform. The analytics platform can then perform any further processing on the migrated data, which excludes unneeded fields. As a result of the pushdown, the amount of data migrated can be reduced, as compared to migrating all the fields of the data. In the case of a FILTER operation, which identifies objects that meet certain criteria in the data, it may be more efficient to perform this operation at the bulk storage, thus delivering to the analytics platform only the records that meet the filtering criteria.