To reduce cost, big data infrastructure (e.g. such as Hadoop, etc.) is typically configured to run over large sets of spinning hard drives directly connected to a large number of hosts. As data typically cannot fit into a host memory and needs to be read and written to a disk, extensive I/O operations are required in order to fetch data associated with a job.
As random access reads and writes are an expensive process, it is preferable to perform sequential access to disks from both reading and writing purposes. Typically, focusing on pure sequential I/O prevents systems from performing fine grained operations such as performing updates to individual records as part of a batch process.
There is thus a need for addressing these and/or other issues associated with the prior art.