The invention relates in general to computerized methods and systems for managing digital datasets stored on a multi-tiered storage system. In particular, it concerns predictive caching methods, wherein datasets that have high probabilities of access are prefetched, e.g., selected in order to be moved across tiers of the storage system.
Multi-tiered storage systems are known, which comprise several tiers of storage. Such systems typically assign different categories of data to various types of storage media, in order to reduce the global storage cost, while maintaining performance. A tiered storage system usually relies on policies that assign most frequently accessed data to high-performance storage tiers, whereas rarely accessed data are stored on low-performance (cheaper, and/or slower) storage tiers.
Consider for example a storage system wherein applications are run on large batches of datasets (e.g., astronomical data repositories, financial transaction logs, medical data repositories). Data that have not been accessed for long periods of time (also called “cold data”) are stored on cheaper (energy efficient) media such as tapes. However, accessing data from such media is also slower and this implies a substantial drop in performance of applications running on data stored in these media.
Storage systems are known, which use data prefetching schemes, which may depend on the dataset access history. In such approaches, statistics of the accessed datasets allow the next accesses to be predicted, such that data prefetching is more effective. However, in big data systems with large amounts of cold data, statistics are often not available, at least not at a dataset level, such that no efficient prefetching can be performed.