Storage Resource Management (SRM) focuses upon optimizing the efficiency and processing speed of a Storage Area Network's (SAN's) use of the available drive space. As organizations are faced with increased hardware and storage management costs, many have introduced automatic storage resource management, where storage virtualization in data centers is used to lower maintenance labor costs. Storage virtualization represents the separation of logical storage from physical storage, where data may be accessed without regard to the physical storage or heterogeneous structure. Particularly, through the use of automatic storage resource management, virtual disks are automatically rearranged and migrated, such that the performance of storage pools can meet specific IT policy requirements (e.g. performance load balancing and capacity planning).
Commercial software, such as VMware's® Storage Distributed Resource Scheduler (SDRS), have been deployed in modern data centers. However, during the era of public/hybrid cloud and big data analytics, traditional storage management schemes fail to respond to the real-time I/O burst in a public/hybrid cloud due to the large size of virtual disks. This commercial software is typically incapable of performing real time policy-based storage management due to the high cost of migrating large size virtual disks. More particularly, although traditional storage resource management schemes work fine in a private data center that executes most of the jobs at daytime (keeping idle at night), modern data centers usually host multi-tenant cloud platforms and run big data applications 24 hours, seven days a week (24/7). Unlike the traditional server applications, big data and cloud workloads exhibit highly fluctuating and unpredictable I/O behaviors. For instance, any user/tenant on the cloud platform can submit jobs at any time, which introduces unexpected workload surges. Secondly, the big data applications manifest distinct I/O behaviors across different execution phases. Since these workload surges occur within a couple of hours or even minutes, they can lead to unexpected storage load imbalance.
Specifically, due to the large size of virtual disks, virtual storage migration takes a long time and causes high I/O traffic overhead to the system. Moving a virtual disk from one storage pool to another can take up to several minutes or hours, during which the workload behavior may have already changed. Worse, the current load-balancing interval (i.e. 8˜16 hours) is too long for detecting and responding to workload surges. These limitations can lead to: 1) high average latency of the entire storage system; 2) extremely unbalanced storage resource utilization; 3) low quality of service (QoS); and 4) frequent breaking of the service level agreement (SLA).
Current models of storage management systems mainly focus upon improving the physical device behavior [8, 19, 20, 21, 22, and 23]. As virtualization has been widely adopted in data centers, efforts of managing storage resources using virtual machines have emerged as indicated above. Although the Singh reference [9] proposes a system entitled, HARMONY, including a VectorDot algorithm that minimizes performance degradation, the VectorDot algorithm only considers the storage system utilization and ignores workload behaviors. Related works, Basil [3], Pesto [4], and Romano [5], consider both the device and workload behaviors; yet, they leverage workload and device characteristics reported by the virtual machine monitor and rearrange storage resources by migrating virtual disks across different storage pools, which is a lengthy process.
The Gulati reference [3] proposes the Basil system, having both workload and device models, which can automatically balance the I/O load across storage devices. Based on these models, storage latency can be predicted and the load-balancing algorithm is performed accordingly. However, the Basil system's storage model is built offline, which limits its usability in a real system.
To address this issue, the Gulati reference [4] proposes the Pesto system implemented in VMware's® SDRS, which incorporates an online storage model (L-Q model). This system implements a workload injector to proactively adjust the storage model online when the system is not busy. The Pesto system further includes congestion management and a cost benefit function. However, the Park reference [5] finds that both the Basil and Pesto systems make improper balance decisions due to the limitation of their models. Park proposes the Romano system, which makes multiple load-balancing decisions before actually migrating the virtual disks, where a simulated annealing algorithm is used to filter out the potentially incorrect decisions.
In summary, all existing storage management schemes share several common drawbacks. The basic unit of these management schemes is a virtual disk, whose size can range from several hundreds of Gigabytes (GBs) to even several Terabytes (TBs). Often, migrating this large size virtual disk results in long migration time and high performance degradation. Although there have been efforts to improve the efficiency of storage migration, the cost of migrating large size virtual disks is still significant. The lengthy migration process hinders the current storage management from being used in real time. Instead of tracking and migrating virtual disks frequently, existing systems usually monitor and collect performance characteristics during the entire daytime, using 95% of the sampled data to predict the average latency of the next day. The actual load balancing decisions and storage migrations are made at night, when no application is running. When a private data center has steady I/O behavior, these traditional methods can achieve a desirable load balancing effect.
Nevertheless, as indicated above, for modern data centers that host public cloud platforms (e.g. Amazon AWS [16], Microsoft Azure [17]) and run big data applications [18], workload I/O behavior can heavily fluctuate even within one day. Although the Basak reference [7] discloses a dynamic performance model for multi-tenant cloud, no resource-scheduling algorithm is proposed. The Alvarez reference [30] presents an approach that selects cloud storage services from a cloud tenant's perspective. Yet, in a multi-tenant cloud environment, highly varying I/O behavior leads to frequent storage load imbalances using this approach, which cannot be handled in a timely manner using existing storage management schemes.