Analytics frameworks are often leveraged for processing large volumes of data to generate insights and information that drive business decisions and research. Such frameworks provide storage and large scale processing of datasets on clusters of servers in an analytics tier. Data is generally stored in an analytics tier in a distributed file system format and on relatively expensive and fast media for high parallel performance and relatively efficient processing.
However, entities are increasingly requiring long term retention of data for purposes of compliance, preservation, research, and other business needs, for example. Considering the relatively large volume of data increasingly being stored, and longer term of storage, less frequently accessed or archival data is often stored on less costly storage media in an archive tier. Storage media in an archive tier might include tapes and/or spun down disks, which are cheaper and require less energy than storage media present in an analytics tier. However, archive tier storage media is relatively slow and exhibits high time to first byte when responding to data retrieval requests. For example, tape libraries require that the media be loaded into an available tape drive before servicing data requests and spun down disks require time to power on.
Accordingly, multiple tiers of storage are generally used for retention of data, which introduces additional cost and inefficiencies when archived data needs to be analyzed using batch analytics platforms executed in an analytics tier. Currently, archive data is often copied from the archive tier to the analytics tier prior to the batch analytics platform operating on the data in order to service a job, which is commonly referred to as ingest-then-compute. The ingest-then-compute technique introduces significant delay and administrative overhead and requires significant additional storage space on the analytics tier to hold the copied data, which is undesirable.
In other implementations, archive tier storage media can be network file system (NFS) mounted and the analytics platform executed by the analytics tier can directly access the archive data. However, the high time to first byte of the archive tier storage media generally does not provide significant improvement in these implementations over the ingest-then-compute implementation. In yet other implementations, the analytics platform communicates the order archived files will be accessed to allow the archive tier to prefetch files. However, the efficiency gains provided by prefetching are limited by the inefficiencies of the archive storage media. In particular, prefetching archived files according to the communicated order still often results in relatively long time to first byte and seek delays depending on the archive data layout.