The creation and storage of digitized data has proliferated in recent years. Accordingly, techniques and mechanisms that facilitate efficient and cost effective storage of large amounts of digital data are common today. For example, a cluster network environment of nodes may be implemented as a data storage system to facilitate the creation, storage, retrieval, and/or processing of digital data. Such a data storage system may be implemented using a variety of storage architectures, such as a network-attached storage (NAS) environment, a storage area network (SAN), a direct-attached storage environment, and combinations thereof. The foregoing data storage systems may comprise one or more data storage devices configured to store digital data within data volumes.
Digital data stored by data storage systems may be frequently migrated within the data storage system and/or between data storage systems during normal operation. For example, when one or more users desire to access or download files, a portion or even the entire contents of a data volume may be sent across the network to be accessed by the user. Such communications may take place between multiple nodes of the storage network, between a storage array and one or more nodes, between other communication devices on the network (e.g. switches and relays), etc.
One issue that has arisen with the creation of new data storage technologies is that various components are configured to store data on media which have different storage characteristics, capacities, cost and reliability. These storage media introduce new storage designs that may have processing components with speed capabilities which are much higher compared to data storage component speeds. To bridge this gap, caching is a common solution where a storage tier that is faster, smaller and more expensive contains a copy (“cache”) of the most reused parts of the working data set at any point in time, while the data set in its entirety resides on a slower, larger, cheaper storage tier. Due to cost considerations, it is prudent to wisely manage these data caches to maximize performance. Consequently, data is often cached ahead of time before the read request comes in for such a data.
Prefetching is a technique to fetch data ahead of time before the input/output (I/O) request for that data arrives so that the cost of a cache miss (e.g. having to fetch data at a slower component) can be avoided. Prefetching can be classified into two broad categories (1) sequential prefetching and (2) history based prefetching. One other dimension in prefetching is to determine the amount of data to be prefetched, which leads to classifying prefetching schemes into (1) N-block lookahead and (2) adaptive prefetch. While sequential prefetching works well for workloads that are dominated by sequential accesses, history-based prefetching is useful for workloads that have random data accesses and repeating patterns of accesses.
One common use of sequential prefetching includes using a sequential prefetching scheme with N-block look-ahead in the file space. This is based on the assumption that a client's data requests are sequential in a file's context. While such read-ahead requests are file-based, these methods generally do not take into account the inter-file block requests and data requests from multiple such clients interleaved with each other.
There are several history-based prefetching schemes that are file-based such as “Design and Implementation of a Predictive File Prefetching Algorithm” by Kroeger et. al. that uses a technique called Extended Partition Context Modelling which takes into account the prefetch lead time. Another similar technique has been proposed by Griffieon et. al. in “Reducing File System Latency using a Predictive Approach,” which uses “lookahead period” as the criterion to relate file accesses. But these approaches are coarse-grained since they are based on files not blocks, and also do not consider important factors such as prefetch wastage and cost of a prefetch. Working at a file granularity has several drawbacks: a) metadata prefetching is not possible if the metadata of the file system is not stored in the form of files, and b) prefetching cost is increased due to “prefetch wastage” by bringing into the cache the entire file's data even though some of the blocks of the file may not be seen in the client's request stream. Also, most history-based prefetching schemes are not adaptive and do not dynamically vary the amount of data to prefetch and what data to prefetch, but use a coarse-grained fixed threshold to cut off what to bring into the cache.
Another approach is discussed in “A Prefetching Scheme Exploiting Both Data Layout and Access History on the Disk” by Song et al. This approach functions at block-granularity and also takes into account history information. However, while this system tries to minimize the penalty of a mis-prefetch, it does not adapt it with a heuristic, but simply uses a boundary for the mis-prefetch percent to continue or stop prefetching altogether. It is also not sensitive to the time of prefetch, e.g., when to initiate a prefetch such that cache hits are maximized without incurring wasted prefetches.