Embodiments of the invention relate in general to the field of prefetching data from a storage system.
Big data systems comprise applications that run on large batches of datasets, e.g., astronomical data repositories, video surveillance systems, medical data repositories, financial transaction logs, etc. In such systems, data that have not been accessed for long periods of time, also referred to as cold data, are typically stored on cheaper energy efficient storage media such as tapes. In systems where the amount of cold data is significant, the energy and infrastructure cost savings realized can be significant.
However, accessing data from such media is also usually slower and this implies a considerable drop in the performance of applications running on data stored in these media.
Prefetching data to faster media can hide the latency and improve performance. Many current state-of the-art systems do employ data prefetching schemes which use data access history to be able to predict data that will be accessed in the near future.
Consider, for example, a system where the accesses to each file in the system are recorded, which is then used to build a model to predict what files will be accessed next given that a file F is accessed currently. In small systems, there is an implicit assumption that the accesses following, and preceding, the access to each file F in the system can be observed. If, for each file F, the statistics of what file is accessed next is known, then predicting accesses becomes feasible and prefetching is effective.
However, in big data systems, due to their sheer size, reliable statistics of accesses following each file may be not available. This is because each file may be not accessed enough number of times to maintain reliable statistics. In fact, for most files, each time a file is accessed, it is most likely the first time it is accessed in its lifetime, and after subsequent accesses in the near future, it may never be accessed again. This renders the above-discussed data prefetching schemes ineffective in such systems, e.g. big data systems.