The present invention relates to data storage systems. More specifically, the present invention relates to methods and apparatus for distributing data over a range of data storage devices.
Configuration and management of a data storage system can be a major undertaking. Planning for a medium-scale installation (e.g., a few terabytes) might take many months, representing a significant fiscal expenditure. High-end applications (e.g. OLTP or decision support systems) typically deal with many terabytes of data spread over a range of physical devices. The difficulties inherent in configuring and managing storage are compounded by the sheer scale of the systems. Additionally, these high-end applications tend to exhibit fairly complex behaviors. Thus, the question of how to distribute data over a range of storage devices while providing some performance guarantees is not trivial.
The configuration and management difficulties are further compounded because the configuration of a data storage system is dynamic. After a system is initially configured, the configuration is likely to change. Applications and databases are added, new devices are added, older devices that become obsolete and defective devices are removed and replaced by devices having different characteristics, etc. Adding to the complexity of configuring a system is the use of network-attached storage devices along with client's desire to share the storage across multiple computer systems with nearly arbitrary interconnection topologies via storage fabrics like fiber-channel networks.
The complexity of configuration and management can lead to poor provisioning of the resources ("capacity planning"). Poor capacity planning, in turn, might result in the use of more data storage devices than needed. This, in turn, can needlessly add to the cost of the data storage system.
Additional problems can flow from poor capacity planning. Poor allocation of data among different devices can reduce throughput. For example, two data sets (i.e., two database tables) that are stored on the same device might be accessed at the same time. Those two data sets could compete for the same throughput resources and potentially cause a bottleneck and queuing delays.
Queuing delays arise when a storage device is in the process of servicing a first request and receives additional requests. The additional requests are usually queued and will not be serviced until an outstanding request is completed by the device. Eventually, the storage device will service all of the requests that are queued; however, response time will suffer.
Analysis of application behavior such as "workload characterization" can be used to improve the capacity planning of data storage systems. For example, if two data sets are competing for the same throughput resources, it would be very useful to identify the degree to which these data sets are being used simultaneously. Once identified, the data sets can be re-allocated to avoid a bottleneck.
Therefore, it would be desirable to have a better understanding of workload characterization in order to better allocate workloads across the storage devices.